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1.  Introduction 

One  of  the  earliest  applications  of  computers  was  the  processing  of  visual  data.  With  the  benefit  of 
hindsight ,  we  can  sec  that  this  reflects  the  importance  of  sight  for  humans,  the  difficulties  faced  by  those  lacking 
sight,  and  the  continuing  drive  in  computer  science  to  automate  human  abilities. 

Ihcrc  is  currently  a  surge  of  interest  in  image  understanding  on  the  part  of  industry  and  the  military. 
Interest  seems  certain  to  expand  over  the  next  several  decades,  as  the  following  list  of  current  applications 
indicates: 

•  AUTOMATION  OF  INDUSTRIAL  PROCESSES. 

Object  acquisition  by  robot  arms,  for  example  by  "bin  picking”. 

Automatic  guidance  of  seam  welders  and  cutting  tools. 

VI  Sl-relatcd  processes,  such  as  lead  bonding,  chip  alignment  and  packaging. 

Monitoring,  filtering,  and  thereby  containing  the  flood  of  data  from  oil  drill  sites  or  from  seismographs. 
Providing  visual  feedback  for  automatic  assembly  and  repair. 

•  INSPI-CI  ION  TASKS 

Hie  inspection  of  printed  circuit  boards  for  spurs,  shorts,  and  bad  connections. 

Checking  the  results  of  casting  processes  for  impurities  and  fractures. 

Screening  medical  images  such  as  chromosome  slides,  cancer  smears,  x-ray  and  ultrasound  images, 
tomography. 

Routine  screening  of  plant  samples. 

•  REMOTE  SENSING 

Cartography,  the  automatic  generation  of  hill  shaded  maps,  and  the  registration  of  satellite  images  with 
terrain  maps. 

Monitoring  traffic  along  roads,  docks,  and  at  airfields. 

Management  of  land  resources  such  as  water,  forestry,  soil  erosion,  and  crop  growth. 

I  xploial ion  of  remote  or  hostile  regions  for  fossil  fuels  ami  umiOi.il  ore  deposits. 
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•  MAKING  COMPUTKR  POWKR  MORF.  ACCLSS1BLR. 

Management  information  systems  that  have  a  communication  channel  considerably  wider  than  current 
systems  that  are  addressed  by  typing  or  pointing. 

Document  readers  (for  those  that  still  use  paper). 

Design  aids  for  architects  and  mechanical  engineers. 

•  MILITARY  APPLICATIONS. 

Tracking  moving  objects. 

Automatic  navigation  based  on  passive  sensing. 

Target  acquisition  and  range  finding. 

•  AIDS  LOR  THF.  PARTIALLY  SIGHTHD. 

Systems  that  read  a  document  and  say  what  was  read. 

Automatic  "guide  dog"  navigation  systems. 

Over  the  past  decade  there  has  been  considerable  growth  in  the  theoretical  base  of  image  understanding 
(III)  by  computer.  This  article  surveys  the  current  state  of  that  theoretical  base.  As  the  intellectual  climate 
for  progress  in  IU  improved,  so  funding  became  available  for  much  needed  basic  research.  Most  of 
the  work  described  in  this  survey  was  conducted  under  the  Defense  Advanced  Research  Project  Agency’s 
(l)ARPA)  image  understanding  program  at  a  small  number  of  basic  research  centers:  Carnegie  Mellon 
University,  the  University  of  Maryland,  Massachusetts  Institute  of  Technology,  the  University  of  Rochester, 
SRI  International,  Stanford  University,  the  University  of  Southern  California,  and  Virginia  Polytechnic  and 
State  University.  Ihc  DARPA  IU  program  has  also  produced  a  number  of  innovative  applications  oriented 
techniques.  For  reasons  of  space,  these  and  other  applications  arc  omitted  from  the  present  discussion. 

I'hcrc  is  a  considerable  diversity  of  approaches  to  processing  visual  images  by  computer.  As  a  result, 
the  boundary  Irctwccn  dilYercm  thrusts  is  oRon  vague,  necessarily  so.  The  chamcterislic  feature  of  I U  is  the 
construction  of  rich  descriptions  from  an  image,  an  idea  that  is  made  more  precise  in  the  following  pages.  Of 
the  many  disciplines  closely  related  to  II  I,  lour  are  of  particular  interest  to  the  computer  science  community: 
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image  processing,  computer  graphics,  computer  aided  design  and  manufacture,  and  pattern  recognition,  image 
processing  is  primarily  concerned  with  die  transmission,  storage,  enhancement,  and  restoration  of  images. 
ITicrc  arc  significant  overlaps  between  IU  and  image  processing,  especially  in  the  early  processing  operations 
of  edge  detection  and  region  finding.  William  K.  Pratt's  book  [PRAT78]  is  an  excellent  introduction  to  the 
subject.  Computer  graphics  is  concerned  primarily  with  the  display  of  visual  information.  Considerable  atten¬ 
tion  has  been  given  to  representing  points,  edges,  surfaces,  and  volumes  to  facilitate  display.  The  geometry 
of  perspective  and  parallel  (or  orthographic)  projection  has  been  studied  in  detail.  Newman  and  Sproull’s 
|NI  WM7.1J  book  is  a  fine  introduction.  Computer  aided  design  and  manufacture  (CAD/CAM)  also  gives 
attention  to  surface  representations  in  order  to  define  paths  for  numerically  controlled  tools  and  for  making 
design  by  traditional  techniques  such  as  "lofting"  amenable  to  mathematical  analysis,  lhc  book  by  Faux 
and  Pratt  |FAUX79|  introduces  the  mathematics  of  CAD/CAM.  Although  these  three  disciplines  are  closely 
related  to  IU,  sometimes  developing  similar  representations  and  uncovering  similar  constraints,  they  differ 
from  Ilf  in  that  they  are  not  concerned  with  the  interpretation  or  understanding  of  images. 

Pattern  recognition  is  much  more  closely  related  to  IU.  Good  introductions  arc  available,  including  Duda 
and  I  lari  [I  )UI)A7.11  and  Pavlidis  [PAVI.78],  Ihe  significant  differences  between  IU  and  pattern  recognition 
are  the  following: 

•  pattern  recognition  systems  arc  concerned  typically  with  recognizing  the  input  as  one  of  a  (usually) 
small  set  of  possibilities.  IU  aims  to  construct  rich  descriptions  that  can  not  be  enumerated  in  advance  but 
need  to  be  constructed  for  each  individual  image.  Three  dimensional  scenes,  viewed  from  an  arbitrary  loca¬ 
tion,  give  rise  to  a  wide  variety  of  occlusion  (overlap)  relationships.  One  can  hope  to  compute  descriptions  of 
three-dimensional  layout  hut  not  to  recognise  it  as  an  instance  of  one  of  a  small  number  of  stored  prototypes. 

•  pattern  recognition  systems  arc  mostly  concerned  with  two  dimensional  images,  such  as  leaf  samples 
or  liiigeiprints.  When  the  images  arc  of  three-dimensional  objects,  such  its  engine  pails,  they  are  effec  tively 
treated  as  two  dimensional,  by  Heating  each  stable  position  as  a  separate  object.  IU  has  dealt  extensively  with 
lines  iliiiieusiiiii.il  images. 
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•  Most  significantly,  pattern  recognition  systems  typically  operate  directly  on  the  image.  ID  approaches 
to  stereo,  texture,  shape  from  shading,  indeed  most  visual  processes,  operate  not  on  the  image  but  on  symbolic 
representations  that  have  been  computed  by  earlier  processing  such  as  edge  detection. 

Before  we  begin  the  survey  proper,  we  note  some  common  themes  that  have  crystallized  over  the  past 
decade. 

•  Attention  has  shifted  from  restrictions  on  the  domain  of  application  of  a  vision  system  to  restrictions  on 
visual  abilities. 

The  most  fundamental  differences  between  image  understanding  as  it  is  now,  and  as  it  was  a  decade 
ago,  stem  from  the  current  concentration  on  topics  corresponding  to  identifiable  modules  in  the  human  visual 
system.  Substantial  progress  has  been  made  in,  for  example,  binocular  stereo,  the  extraction  of  important  in¬ 
tensity  changes  from  an  image,  the  interpretation  of  surface  contours,  the  determination  of  surface  orientation 
from  texture,  tire  computation  of  motion,  and  the  representation  of  three-dimensional  objects.  The  focus  of 
current  research  is  defined  more  narrowly  in  terms  of  visual  abilities  than  by  restricting  attention  from  the  start 
to  a  domain  of  application.  The  depth  of  analysis  is  correspondingly  greater.  Increasingly,  the  progression  is 
from  general  theoretical  developments  to  specific  practical  applications.  Hie  alternative  approach  of  inferring 
general  principles  from  work  in  a  limited  practical  domain  is  still  present,  but  less  so  than  formerly. 

What  identifies  a  particular  operation  as  a  distinguishable  module  in  the  visual  system?  Some  of  the  most 
solid  evidence  for  the  claims  of  individual  modules  is  offered  by  psychophysical  demonstrations  of  human 
visual  abilities.  Care  is  taken,  as  far  as  possible,  to  isolate  a  particular  source  or  information  and  show  that 
the  perceptual  ability  in  question  survives.  One  particularly  intriguing  source  of  evidence  for  modules  in 
die  human  visual  system  comes  from  the  study  of  patients  with  disabilities  resulting  from  brain  lesions  (for 
example  Weiskrantz,  Warrington,  Sanders  and  Marshall  [WKIS74],  Marshall  and  Newcombc  [MAKS  73], 
Stevens  [STKV  76].  Many  psychophysical  experiments,  seemingly  isolating  particular  modules  of  the  human 
visual  system,  have  been  reported  in  the  literature.  Notable  examples  include  Gibson's  demonstration  of  die 
perception  of  surface  shape  from  texture  gradients  |GIUS50|.  I  ami's  demonstration  of  the  compulation  of 
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lightness  (I  AND71J,  [HORN74],  and  Julcsz's  demonstration  of  stereoscopic  fusion  without  monocular  cues 
|JUI.K71],  In  some  cases  there  is  clear  evidence  of  a  human  perceptual  ability,  although  such  evidence  would 
hardly  be  referred  to  as  psychophysical.  Horn’s  work  at  MIT  considers  the  highly  developed  human  ability 
to  infer  shape  from  shading  [HORN77,  WOOD81, IKEU81J.  Stevens  considers  the  three-dimensional  inter¬ 
pretation  of  surface  contours  by  humans  [STHV81J.  On  the  other  hand,  it  is  equally  clear  that  wc  do  not 
have  a  specific  module  in  our  visual  system  to  recognize  "yellow  Volkswagens"  (sec  for  example  [WKIS73J. 
It  is  less  clear  whether  we  compute  depth  directly,  as  opposed  to  indirectly  through  integrating  over  surface 
orientations,  or  what  use  we  make  of  directional  selectivity  or  optical  flow. 

The  change  of  focus  from  a  narrowly  specified  domain  of  application  to  a  particular  module  of  the  human 
visual  system  has  had  a  number  of  far-reaching  consequences  for  the  way  IU  research  is  conducted.  One 
consequence  has  been  a  sharp  decline  in  the  construction  of  entire  vision  systems  that  mobilize  knowledge  at 
all  levels,  including  infomiation  specific  to  some  domain  of  application.  In  order  to  complete  the  construction 
of  such  systems,  it  is  almost  inevitable  that  corners  be  cut  and  many  overly  simplified  assumptions  be  made. 

•  Representations  have  been  developed  that  make  explicit  the  information  computed  by  a  module. 

A  number  of  representations  arc  discussed  in  this  survey,  including  the  primal  sketch,  the  reflectance 
map,  intrinsic  images,  normalized  texture  property  maps,  and  object  representations  based  on  generalized 
cones.  A  simple  observation,  which  nevertheless  has  profound  consequences,  is  that  not  all  modules  work 
directly  on  the  image.  Indeed,  it  seems  that  few  do.  Instead,  they  operate  on  representations  of  the  informa¬ 
tion  computed,  or  made  explicit,  by  other  processes.  In  the  case  of  stereo,  Marr  and  I’oggio  argue  against 
correlating  the  intensity  information  in  the  left  and  right  images  [MARR7l)bl.  Instead,  they  suggest  that  edge 
feature  points  an  matched  (see  Section  4.1).  Baker  and  llinford.  Arnold,  and  Mayhcw  and  1-risby  argue  that 
matching  should  in  fact  lake  place  on  a  difl'erent  representation, called  the  primal skctcl\\i<\K\:$\,  ARN078, 
MAYII8I|. 

Combining  this  observation  with  the  previous  point  about  modules  of  (he  visual  system  leads  to  a  view 
ot  visual  perception  as  the  piocess  of  const ittcliup  instances  ot  a  sequence  of  representations.  I  o  each  module 
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there  corresponds  a  representation  on  which  it  operates,  and  a  representation  that  it  produces.  The  first  of 
these  representations,  and  the  one  whose  structure  is  least  subject  to  dispute,  is  the  image  itself.  Not  surpris¬ 
ingly,  most  attention  has  centered  on  those  modules  that  operate  upon  the  image  (section  3).  As  we  shall  see, 
the  further  we  progress  up  the  processing  hierarchy,  the  less  secure  the  story  becomes,  as  the  exact  structure 
of  the  representations  becomes  more  subject  to  dispute,  litis  is  hardly  surprising.  The  image  aside,  any 
representation  is  one  module’s  input  and  another's  output.  Computer  science  teaches  us  that  all  of  them  shape 
its  eventual  structure. 

For  example,  several  modules  of  the  visual  system  provide  information  about  the  layout  of  visible  sur¬ 
faces.  Stereo  provides  disparity,  from  which  local  shape  and  relative  depth  can  be  computed.  Motion,  texture, 
and  shading  all  provide  evidence  for  shape.  Barrow  and  Tenenbaum  have  suggested  that  a  number  of  different 
viewer  centered  representations  make  explicit  important  information  associated  with  surfaces  [BARR78].  They 
call  such  representations  intrinsic  images  and  propose  specific  intrinsic  images  for  depth,  motion,  surface 
topography,  and  color.  The  name  intrinsic  images  stems  from  Barrow  and  Tenebaum's  idea  that  the  repre¬ 
sentations  arc  addressed  using  the  same  coordinates  as  the  image.  For  example  the  color  at  an  image  point 
whose  coordinates  arc  p  might  be  found  in  representation  C  as  C(p).  Others,  notably  Marr  and  Horn  have 
suggested  a  single  representation  that  makes  explicit  local  surface  orientation  and  discontinuities  of  depth 
[MARR78a,  HORN82).  The  precise  details  arc  uncertain  at  the  time  of  writing. 

•  The  mathematics  of  image  understanding  are  becoming  more  sophisticated 

Mathematical  analyses  have  been  offered  for  some  of  the  elements  of  visual  perception,  such  as  the 
relationship  between  image  irradiancc  and  scene  radiance,  (he  location  of  important  intensity  changes,  and 
motion  primitives.  In  each  ease,  it  is  observed  that  the  information  in  the  image  only  partially  constrains 
the  interpretation  of  the  image,  and  further  constraints  arc  sought  The  additional  constraints  embody  commit¬ 
ments  about  the  way  d«c  world  is,  at  least  most  of  die  time.  For  example,  die  world  mostly  consists  of  smooth 
surfaces,  and  scenes  are  mostly  viewed  from  a  position  free  of  accidental  alignments.  Perceptual  abilities  such 
as  stercopsis.  lightness  determination,  and  shape  from  shading  and  from  texture,  require  that  the  appropriate 
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constraints  be  uncovered  and  appropriately  expressed. 

Most  of  the  analyses  to  be  discussed  below  begin  with  a  precise  description  of  the  representations 
operated  on  and  produced  by  the  visual  process  under  scrutiny.  Increasingly,  "precise"  means  "mathematically 
precise",  as  the  technical  content  of  image  understanding  has  become  steadily  more  sophisticated.  Many 
observations  about  the  world,  as  well  as  our  assumptions  about  it,  arc  naturally  articulated  in  terms  of  the 
"smoothness"  of  some  appropriate  quantity.  This  intuitive  idea  is  made  mathematically  precise  in  a  number  of 
ways  in  real  analysis,  for  example  in  conditions  for  differentiability.  Relationships  between  smoothly  varying 
quantities  give  rise  to  differential  equations,  such  as  Horn's  image  Irradiancc  liquation.  We  shall  discover  the 
value  of  making  the  image  forming  process  explicit.  This  in  turn  leads  to  a  concern  with  geometry,  such  as 
the  properties  of  the  gradient,  stcrcographic,  and  dual  spaces.  Combining  geometry  and  smoothness  leads 
naturally  to  multi-variate  vector  analysis,  and  to  differential  geometry.  For  the  most  part,  a  representation 
docs  not  of  itself  contain  sufficient  information  to  guarantee  that  a  module  can  uniquely  arrive  at  the  result 
computed  so  effortlessly  by  the  human  visual  system.  Additional  assumptions,  in  the  form  of  constraints,  are 
required.  This  observation  has  led  to  application  of  constraint  satisfaction  and  equation  solving  techniques 
from  numerical  analysis  as  well  as  various  instantiations  of  I  ^grange  multipliers  (especially  in  the  form  of  the 
calculus  of  variations). 

•  l  ocally  parallel  architectures  have  been  developed. 

The  majority  of  the  work  to  be  described  here  had  its  initial  expression  in  the  form  of  complex  computer 
programs.  A  common  complaint  about  artificial  intelligence  in  general,  and  image  understanding  in  particular, 
used  to  be  that  it  not  only  did  not  run  in  real  time,  but  inherently  could  not.  To  the  extent  dial  this  referred  to 
so-called  "heteraichical"  programs  of  the  1970  s  vintage,  tins  was  justified.  However,  artificial  intelligence  has 
been  well  advised  not  to  make  ical  time  performance  its  most  impoitanl  metric  of  success,  since  such  a  metric 
often  implicitly  assumes  a  particular,  usually  sequential,  model  of  computation. 

Many  recent  vision  algorithms  take  the  Ibnn  of  parallel  computations  involving  local  interactions.  Once 
the  ideas  ate  full)  fixed  in  software,  they  .ire  naturally  realized  in  hardware.  Davis  and  Roscufcld  icvicw  one 
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popular  class  of  program  structures,  called  "relaxation"  [DAVI81].  In  the  ease  of  edge  finding,  one  algorithm 
has  been  implemented  in  TTL  logic  (N1SH81],  and  several  others  in  CCD(NUDD79].  The  current  rapid  pace 
of  developments  in  VLSI  has  further  motivated  research  into  local  parallel  programming  architectures.  It  is 
likely  that  our  concept  of  computation  will  change  as  a  result  of  such  developments.  Vision  will  be  one  of  the 
first  areas  to  benefit  from  such  advances.  It  seems  that  it  will  also  be  a  continuing  source  of  inspiration  to  VLSI 
designers  [BATA81,  NUDD79).  As  more  sophisticated  ideas  are  embodied  in  hardware,  new  applications  of 
image  understanding  will  become  feasible. 

•  There  are  growing  links  between  image  understanding  and  theories  of  human  vision. 

For  many  authors,  the  changing  style  of  research  in  image  understanding  has  not  been  simply  a  matter 
of  a  narrowing  of  attention  and  a  more  highly  developed  technical  content.  Instead,  greater  significance  is 
attached  to  forging  explicit  links  between  IU  and  psychophysics  and  neurophysiology.  From  this  perspective, 
image  understanding  aims  at  the  construction  of  computational  theories  of  human  visual  perception.  In 
large  part,  this  approach  stems  from  a  series  of  papers  written  by  David  Marr  and  his  colleagues  at  MIT. 
Marr's  work  derives  from  a  background  in  neurophysiology,  and  is  expressly  addressed  to  psychophysicists 
and  neurophysiologists,  among  whom  it  has  excited  considerable  interest.  In  particular,  it  is  couched  in 
terms  they  arc  accustomed  to,  and  makes  extensive  reference  to  their  literature,  rather  than  that  of  computer 
vision.  A  book  describing  Marr’s  thoughts  about  human  visual  perception  and  incorporating  summaries  of 
the  contributions  he  and  his  colleagues  have  made  across  the  entire  range  of  (he  subject  is  currently  in  press 
[MARR82J. 

It  might  be  imagined  that  there  would  be  considerable  differences  of  emphasis,  subject  matter,  and  tech¬ 
nical  content  between  the  work  of  those  researchers  who  see  themselves  constructing  a  computational  theory 
of  human  visual  perception  and  those  for  whom  human  visual  perception  is  at  most  a  matter  of  secondary  con¬ 
cern.  Ibis  turns  out  not  to  be  the  ease.  For  example,  the  ACRONYM  system’s  representation  of  objects  based 
upon  generalized  cones  bears  many  milaritics  to  »'  it  proposed  by  Marr  and  Nishihara,  who  relate  their  wort 
to  human  pervcplion|BR0079,  MARR'’'  ,|.  Again,  I  loin  and  Schunek's  work  on  the  determination  of  optical 
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flow  has  intriguing  similarities  to  the  directional  selectivity  work  of  Marr  and  Ullman  that  was  inspired  by 
neurophysiology  (HORN81c,  MARR81). 

Figure  i  shows  some  of  the  representations  and  modules  to  be  discussed  in  the  remainder  of  the  paper. 
The  figure  is  intended  to  make  the  organization  of  the  paper  easier  to  understand,  but  it  should  be  treated  with 
caution.  The  organization  implicit  in  the  figure  is  similar  to  that  given  in  Barrow  and  Tenenbaum  [BARR81b) 
and  Marr  (MARR78).  The  representation  referred  to  here  as  the  "surface  orientation  map"  is  intended  to 
cover  what  Marr  calls  the  "2^D  sketch"  [MARR78a],  Horn  calls  the  "needle  map"  [HORN82],  and  Barrow 
and  Tenenbaum  call  "intrinsic  images"  [BARR78]. 

Ihe  paper,  and  hence  the  figure,  is  limited  in  scope.  As  mentioned  above,  there  is  little  discussion  of 
applications.  There  is  little  if  anything  about  color,  and  only  cursory  discussions  of  motion.  The  extraction  of 
useful  information  from  color  is  still  extremely  rudimentary.  Motion  has  received  some  attention  recently,  but 
findings  are  preliminary.  For  example,  it  is  far  too  early  to  know  what  information  can  be  computed  reliably 
from  the  changing  patterns  of  brightness  called  the  optical  flow  (see  section  3.2).  A  pervasive  view  of  motion 
pen: ..on  is  that  it  arises  from  temporal  changes  to  the  representations  that  arc  important  for  static  vision. 

I  he  Marr-Hildrcth  theory  of  edge  detection  inspired  Marr  and  Ullman’s  work  on  directional  selectivity,  the 
primal  sketch  led  to  Ullman 's  work  on  long  range  motion,  and  Horn’s  work  on  shape  from  shading  underlies 
the  work  of  Horn  and  Schunck  on  the  determination  of  optical  flow. 

Judged  as  a  flow  diagram,  figure  1  suggests  that  the  flow  of  information,  and  the  construction  of  repre¬ 
sentations.  is  entirely  sequential,  proceeding  from  the  lowest  level  operations  on  the  image  to  more  semantic 
higher  level  operations.  Many  authors  have  argued  that  perceptual  processing  cannot  be  so  rigidly  sequential. 

I  hey  suggest  dial  perception  is  opportunistic,  taking  advantage  of  whatever  information  becomes  available  in 
an  image.  Natural  scenes  arc  normally  highly  redundant.  Gibson  (GIBS50)  notes  approximately  23  distinct 
cues  for  determining  depth  and  surface  layout,  many  of  which  arc  available  in  most  images.  I  lowcver  if  only 
an  unpredictable  small  selection  of  cues  arc  available,  vision  is  not  usually  impaired.  Only  when  a  single  cue  is 
present,  as  in  the  laboratory  settings  of  experimental  psychology,  is  our  perceptual  system  easy  to  fiiol.  Minsky 


Figure  1.  Some  of  (he  representations  and  modules  discussed  in  the  paper. 


and  I’jpcri  |M  I NS72)  suggested  that  tlie  flexible  processing  of  information  by  the  perceptual  system  might 
best  be  modelled  by  process  interactions.  This  produced  a  rash  of  programs  in  which  relatively  high  level 
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knowledge  could  actively  intervene  to  modify  the  course  of  low  level  processing.  Examples  include  [SHIR73, 
BAJC75,  BAJC76B.  TENH77,  BRAD78,  HANS77,  BR0079,  SELF81J.  Similar  "heterarchical"  programs 
were  experimented  with  in  speech  perception  [LESS77].  The  performance  of  such  programs  did  not  give  cause 
for  unbridled  celebration.  Some  of  the  associated  difficulties  are  reviewed  in  [BR  AD79J. 

A  rather  different  kind  of  flexibility  is  made  available  by  local  parallelism.  (WALT72]  showed  how  a 
variety  of  cues  could  be  combined  to  yield  an  overall  interpretation.  [DAV181]  stress  that  an  attribute  of  such 
process  structures  is  their  insensitivity  to  the  sequence  in  which  operations  arc  performed.  However,  local 
parallel  processes  have  their  own  problems.  It  is  easy  enough  to  start  local  parallel  processes  going.  It  is  less 
easy  to  guarantee  that  they  will  stop  (but  see  (HUMM80J),  or  to  be  able  to  make  solid  assertions  about  the  final 
stale  of  computation  when  they  do  stop.  It  may  be  that  process  structuring  will  become  a  key  component  of 
image  understanding,  but  currently  it  is  simply  too  early  to  be  sure.  For  the  moment  it  seems  best  to  remain 
agnostic  and  concentrate  on  the  solid  achievements  of  the  past  decade,  most  of  which  arc  largely  independent 
of  process  structuring. 

Organization  of  the  paper 

In  the  next  section  we  present  a  brief  review  of  work  in  geometrically  simple  "microworlds".  Some 
of  the  generally  important  ideas  developed  initially  for  the  blocks  world  of  line  drawings  of  polyhedra  arc 
introduced.  Kanadc’s  extension  to  the  world  of  origami,  and  Barrow  and  Tenenbaum’s  work  on  curved  "play 
dough"  figures  is  mentioned. 

Section  3,  by  far  the  longest  in  the  paper,  discusses  modules  that  operate  directly  upon  the  image. 
Subsection  3.1  concerns  edge  finding,  3.2  the  determination  of  shape  from  shading,  3.3  texture,  and  3.4 
segmentation. 

Section  4  discusses  modules  that  operate  on  the  output  of  section  3,  which,  following  |MARR76a].  we 
call  the  primal  sketch.  Subsection  4.1  discusses  stereo,  4.2  shape  from  contour,  4.3  shape  from  texture  and 
Render's  generalization  to  "shape  from  you  name  it".  Finally,  subsection  4.4  briefly  discusses  shape  from 
motion. 
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Sections  5  and  6  discuss  modules  that  operate  on  surface  orientations  and  viewpoint  independent  repre 
sentations. 


2.  Review  of  work  on  geometrically  simple  microworlds 

Beginning  with  the  seminal  work  of  [ROBE62],  much  early  attention  of  1U  was  devoted  to  interpreting 
line  drawings  of  polyhedra  automatically.  This  work  marked  a  significant  break  from  pattern  recognition  in 
that  it  emphasized  descriptions  of  the  objects  present  in  a  scene  and  the  spatial  relationships  between  them. 
For  example,  figure  2  might  be  described  as  a  cube  standing  in  front  of  a  block.  Clowes  and  Huffman  stressed 
that  the  relationship  between  a  scene  and  its  image  needs  to  be  made  explicit  [CI.OW71,  HUFF71].  A  line  is 
the  image  of  the  edge  of  a  polyhedron  in  the  scene.  They  noted  that  lines  can  be  labelled  as  convex,  concave, 
or  occ)uding( figure  3a).  The  interpretation  of  a  line  can  not  change  along  its  length.  A  junction  is  the  image 
of  a  three-dimensional  vertex.  Enumeration  of  the  local  volumes  occupied  by  vertices,  and  the  appearance 
of  such  vertices  from  all  possible  viewpoints  gives  rise  to  a  set  of  labellings  for  junctions  (figure  3b).  Vertex 
labellings  embody  a  local  constraint:  although  there  arc  three  lines  forming  an  arrow  junction,  and  each  line 
has  four  possible  interpretations  (counting  the  two  senses  of  occlusion  separately),  there  arc  not  41  =  64 
physically  realizable  labellings  for  an  arrow  vertex  but  only  3.  Notice  that  every  interpretation  of  a  T-junction 
is  assumed  to  signal  an  occlusion  of  the  stem.  Conversely,  every  scene  occlusion  gives  rise  to  a  T-junction.  The 
constraints  local  to  each  junction  propagate  along  the  lines  that  connect  them  to  adjacent  junctions,  possibly 
rendering  some  of  the  initial  set  of  labellings  at  both  junctions  impossible.  Clowes  determined  consistent 
interpretations  by  a  search  space  technique.  Surprisingly,  many  simple  line  drawings  have  many  consistent 
interpretations,  though  occlusion  often  resolves  ambiguity. 

Despite  the  geometric  restrictions  imposed  by  Huffman  and  Clowes,  their  scheme  had  limited  com¬ 
petence.  First,  as  Kanadc  pointed  out,  the  lluffinau-Clowcs  scheme  was  essentially  qualitative  in  dial  it  could 
not  distinguish  between  the  truncated  pyramid  shown  in  figure  4a  and  llic  cube  shown  in  figure  4b  [KANAX1|. 
Human  peiceplion  is  at  least  partly  quant  it  alive  since  we  readily  assign  slopes  to  line  drawn  surfaces  and 
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Figure  1  A  typical  line  drawing  of  poiyhedra  studied  by  Huffman  and  Clowes. 

estimate  ret  tango  larity  of  vertices  from  junctions.  Since  the  line  drawing  in  figure  4b  can  be  the  image  of  an 
infinite  set  of  scenes,  it  is  more  precise  to  say  that  the  I  luffman-Clowcs  scheme  could  not  determine  that  figure 
4a  has  no  interpretation  for  which  vertex  A  is  rectangular  while  figure  4b  docs.  It  is  also  interesting  to  ask  why 
the  cube  is  perceived  as  a  cube.  One  proposal,  due  to  Kanade,  is  sketched  below. 

A  second  manifestation  of  the  qualitative  nature  of  the  HufTman-Clowcs  scheme  is  its  inability  to  detect 
the  impossibility  of  the  line  drawing  shown  in  figure  5.  Huffman's  paper  was  principally  concerned  with 
"impossible  objects"  (such  as  that  depicted  in  figure  5),  and  the  consequent  need  for  a  more  expressive  repre¬ 
sentation.  He  proposed  a  representation  called  dual  space  and  an  orthographic  projection  of  it  called  the  dual 
picture  graph.  Mack  worth  (M  ACK73J  developed  the  idea  of  a  representation  of  surface  shape  further  by  intro¬ 
ducing  gradient  space,  an  idea  that  was  developed  in  (DRAPSO,  DKAI’SI,  HORN77,  KANA80,  KANA8I, 
kl  NDKII.IIUI  I  77, SI  IGI78,  SUGliil). 


IS 


Figure  3.  a.  The  possible  interpretations  of  an  image  line.  b.  The  possible  interpretations  of  a 
trihedral  vertex. 


Consider  the  imaging  geometry  depicted  in  figure  6:  a  surface  /( x,  y)  -2  =  0  is  viewed  from  a  great 


distance  along  the  negative  2-axis.  Applying  the  chain  rule. 
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Figure  4.  The  HufTman-Gowes  scheme  could  not  distinguish  these  line  drawings. 

%dl+%d«~d‘  = 

iliat  is 

so  tli.it  (% —  0  arc  the  direction  ratios  of  the  surface  normal  or  gradient.  It  is  customary  to  denote  %  by 
p  and  by  q.  I  he  coordinate  frame  based  on  (p,  q)  is  called  gradient  space.  A*  an  example  consider  a  planar 
facet  ax  )-  by  +  e  —  *  =  0.  The  gradient  has  p  =  a,q  —  b.  The  origin  of  gradient  space  corresponds  to 
surface  facets  that  point  directly  at  the  viewer.  Moving  away  from  the  origin,  it  K  easy  to  show  that  (p2  -f- 
is  the  slant  of  the  mii  lace  normal.  Ihe  angle  r  whose  Lmivni  is  q/p  is  Die  lilt  of  the  sini-  e  ii-innaKfignie  /). 
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Figure  &  The  Huffman-Clowes  scheme  could  not  determine  that  this  line  drawing  depicts  an 
"impossible  object". 

The  coordinates  can  be  aligned  so  that  a  vector  (x,  y,  z)  =  y  projects  to  (x,  y)  =  k  x  (y  X  &).  where 
k  is  the  unit  vector  in  the  z  direction.  In  particular,  the  gradient  vector  (p,  q,  — 1)  projects  to  (p,  q).  Suppose 
two  planes  Pi  and  Pi  have  surface  normals  (p»,  q„  —1),  and  suppose  that  they  meet  in  a  space  vector  y.  It  is 
easy  to  show  that  the  image  /  of  y  is  perpendicular  to  the  dual  line  connecting  g\  =  (pi,«i)  to  91  —  (pi,Qi) 
[MACK73J.  Furthermore,  y  is  convex  if  and  only  if  the  order  of  the  p,  across  l  is  the  same  as  the  order  of  the 
images  of  Pt  across  l  (figure  8).  Mackworth  exploited  this  observation  in  a  program  that  was  capable  of  deter¬ 
mining  the  impossibility  of  the  notched  tetrahedron  shown  in  figure  5.  However.  Mackworth ‘s  iriangulation 
solution  scheme  could  not  dciciminc  the  impossibility  of  the  notched  cube  also  shown  in  figure  5  |M  ACK73J. 
1  )rapcr  (I3RAP81)  has  analyzed  the  competence  of  Mackworth's  gradient  space  scheme  and  an  extension  due 
to  I  luflman  based  on  "dual  space"  (HUFF77], 

The  notched  cube  of  figure  5  illustrates  an  assumption  discussed  by  kunodc  |KANA8I|.  namely  linn  iluii 
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Figure  &  Viewing  geometry  Tor  defining  gradient  Space. 


are  parallel  in  the  image  are  the  images  of  vectors  that  are  parallel  in  space.  If  lines  / 1  and  h  arc  the  images  of 
scene  vectors  p,  and  then  it  is  easy  to  show  that  l\  is  parallel  to  h  if  and  only  if  the  triple  scalar  product 
[V|,  Vi,  it)  is  zero.  It  follows  that  Kanadc's  parallel  line  assumption  fails  only  when  a,,  v2,  and  &  arc  coplanar. 
Generally,  people  find  it  difficult  to  interpret  such  foreshortened  figures  properly  [MAR R78b,  MARR78aJ. 

Kanadc  (KANA81)  has  also  studied  an  interesting  assumption  involving  what  he  calls  "skew-symmetry". 
Consider  figures  9a,  9b  and  9c.  AH  three  arc  interpreted  as  symmetric,  planar  figures  viewed  obliquely.  As 
figure  9d  shows,  a  skew  symmetry  defines  two  directions:  the  image  of  the  axis  of  symmetry,  called  the  skewed 
symmetry  axis,  and  the  image  of  the  normal  to  the  axis  of  symmetry  that  lies  in  the  plane  of  the  figure,  called 
the  skewed  transverse  axis.  Skew  symmetries  feature  prominently  on  the  cube  and  truncated  pyramid  shown 
in  ligmc  4.  Kanadc  proposes  that  a  skewed  symmetry  is  always  interpreted  as  the  image  of  a  real  symmetry 
viewed  obliquely.  This  assumption  gives  rise  to  a  constraint,  expressed  in  terms  of  the  angles  a  and  defined 
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Figure  7.  Slant  and  lilt  in  gradient  space. 

in  figure  9d.  relating  the  possible  gradients  of  the  surface  containing  the  real  symmetry.  In  fact,  thc*possiblc 
gradients  form  the  hyperbola  shown  in  figure  10.  Notice  that  the  possible  planes  with  least  slant  (the  tips 
of  the  hyperbola)  have  a  normal  that  projects  into  the  bisector  of  the  skewed  symmetry  axis  and  the  skewed 
transverse  axis.  This  accords  with  a  heuristic  finding  of  Stevens  [STKV80J. 

It  is  important  to  realize  that  the  parallelism  and  skew-symmetry  assumptions  apply  beyond  the  blocks 
world.  Kanadc  has  shown  how  they  can  be  combined  with  Huffman-Clowcs  style  labelling  and  Mackworth- 
stylc  algebraic  analysis  to  give  both  a  quantitative  and  a  qualitative  interpretation  of  line  drawings  in  the 
microworlds  of  blocks  and  origami  constructions(KANA81J. 

Ihc  junction  labelling  constraints  ofHiiflmnn  and  Clowes  arc  essentially  local.  The  constraints  of  surface 
planarity,  skew  symmetry,  and  parallelism  arc  less  local  and  support  more  competent  programs.  I  lowevcr. 
none  of  the  constraints  arc  global  in  the  sense  that  they  apply  simultaneously  to  all  parts  of  the  image.  Waltz. 
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Figure  8.  Convexity  preserves  order  across  the  gradient  line. 

investigated  the  global  constraint  afforded  by  the  shadows  cast  by  a  single  distant  light  source  [WALT72], 
11)0  number  of  interpretations  of  a  line  rose  from  4  to  12,  with  a  consequent  massive  number  of  possible 
junction  labellings.  As  Draper  has  pointed  out  the  large  (and  probably  unverified)  labelling  sets  would  be 
considerably  larger  without  the  assumption  of  general  position  of  the  viewer  [DRAP80J.  Waltz’s  line  labels 
incorporate  information  about  the  surface  geometry,  illumination,  and  surface-object  boundaries.  The  huge 
label  sets  precluded  a  tree  search  of  the  sort  used  by  Clowes  (CI.OW71J.  Instead,  Waltz  designed  a  filter 
program,  potentially  capable  of  running  as  a  local  parallel  program,  that  usually  convcrgcil  to  a  single  labelling 
in  near  linear  time.  The  Waltz  filter  accelerated  investigation  of  local  parallelism.  Line  labelling  is  discussed 
by  [ZUCK77,  ZUCK81,  HUMM80J.  Waltz’s  program  reaffirmed  the  value  of  redundancy  when  processing 
can  make  appropriate  use  of  it.  However,  the  complex  line  labellings  confounded  too  much  information  from 
different  levels  of  the  visual  system  in  an  impoverished  representation. 
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Figure  9.  Skewed  symmetry,  a-c:  examples  of  skew  symmetry,  d.  definition  of  skewed-symmetry 
axis  and  skewed  transverse  axis.  (Reproduced  from  [KANA81],  figure  16) 

The  figures  discussed  in  this  section  have  ail  been  images  of  objects  with  planar  surfaces.  Some  authors 
have  tried  to  relax  this  restriction.  One  difficulty  with  drawings  of  curved  surfaces  is  that  one  of  the  basic 
assumptions  of  tire  I  lufl'man-Clowcs  work  no  longer  holds:  a  line  can  change  its  interpretation  from  one  end 
to  the  other  (IIUI  T7IJ.  Turner  |  I1JKN74|  noted  that  such  changes  of  interpretation  arc  not  arbitrary,  and 
lie  allowed  a  small  number  of  transformations  of  a  line  label  to  arrive  at  an  interpretation.  Kcecnily,  limloid 
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Figure  10.  A  skewed  symmetry  defined  by  the  angles  a  and  /?  can  be  the  projection  of  a  real 
symmetry  on  a  plane  whose  gradient  is  {p,q)  if  and  only  if  the  gradient  lies  on  the  hyperbola 
shown.(Reproduced  from  [KANA81J.  figure  17) 


|ltf  NI*X1)  and  I  owe  and  Bin  ford  |l,OWl.Xl|  have  suggested  more  general  interpretations  of  curved  lines  that 
may  enable  labelling  techniques  to  be  extended  to  line  drawings  of  arbitrarily  curved  surfaces  (sec  also  section 
3.1.3). 


Hamm  and  Toncnbaiim  fH/\KR7K|  have  also  studied  a  microworld  of  curved  objects.  They  combine  line 


labelling  techniques  with  Horn's  work  on  shape  from  shading  (see  section  3.2)  to  interpret  idcali/.cd  images  of 
"play  dough"  scenes. 

Work  in  geometrically  simple  microworlds  has  played  an  important  role  in  the  development  of  image 
understanding.  From  the  pioneering  work  of  Roberts,  Clowes,  and  Huffman  to  the  present  day.  the  goal  has 
been  to  generate  descriptions  rather  than  transformed  or  classified  images.  ITic  key  has  been  to  make  the 
relationships  between  the  scene  and  the  image  explicit.  Examples  include  the  interpretations  of  image  lines  its 
visible  edges,  and  the  analyses  of  skew  symmetry  and  parallelism.  Mackworth’s  development  of  gradient  space 
points  up  the  need  for  rich  representations.  Finally,  Waltz’s  work  shows  that  redundancy  can  be  exploited  by 
appropriate  computing  mechanisms. 

Microworlds  also  set  traps.  It  is  irrcsistably  tempting  to  deploy  domain  specific  information  at  the  earliest 
opportunity.  Planar  objects  have  a  number  of  global  properties  dial  arc  not  enjoyed  by  curved  objects.  For 
example,  two  planes  intersect  along  a  single  straight  edge  in  space,  so  that  from  any  given  viewpoint,  one 
plane  is  always  in  front  of  the  other  on  one  side  of  the  image  of  the  edge,  and  always  behind  it  on  the  other 
[DRAP81J.  The  labelling  schemes  of  Huffman.  Clowes,  and  Waltz,  extended  to  idealised  images  of  curved 
objects  with  reflectance  patches  and  shadows,  produce  a  vast  number  of  labels  that  confound  many  distinct 
sources  of  information  in  a  single  label.  It  seems  more  fruitful  to  attempt  to  tease  out  the  information  provided 
by  each  of  these  sources  separately. 

3.  Modules  that  operate  on  the  image 

3.1  Edge  detection 

A  great  deal  of  effort  has  been  devoted  to  understanding  how  die  significant  intensity  changes  in  an 
image  can  be  extracted,  and  how  the  resultant  information  can  best  be  represented.  Marr  coined  the  term 
primal  sketch  to  describe  such  a  representation  (MARR76aJ.  Significant  intensity  changes  correspond  to  a 
variety  of  events  in  a  scene,  such  as  depth,  reflectance,  and  shadow  boundaries,  as  well  as  discontinuities  in 
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surface  orientation.  The  image  intensities  I(x,y)  form  a  surface  that  is  a  discrete  approximation  to  one  that  is 
continuous  nearly  everywhere  [ROSE76,  PRAT79].  Quantization  and  sensor  noise  of  various  sorts  complicate 
the  formulation  of  a  predicate  that  can  completely  reliably  determine  which  intensity  changes  correspond  to 
perceptible  scene  events  (that  is,  which  arc  "significant"). 

It  has  been  observed  repeatedly  over  the  past  twenty  years  that  intensity  changes  correspond  to  maxima 
of  the  gradient  of  the  image  surface,  equivalently  a  place  at  which  the  second  derivative  crosses  zero  and 
changes  sign.  Many  local  operators  have  been  developed  to  approximate  first  and  second  directional  deriva¬ 
tives  by  first  and  second  differences.  A  representative  sample  is  shown  in  figure  11.  Mostly,  such  operators 
were  developed  and  tuned  for  a  limited  domain  of  application. 

Figure  12  shows  an  idealized  step  change  in  intensity  and  the  response  of  first  and  second  difference 
operators.  In  practice,  gradient  operators  tend  to  produce  a  large  response  over  a  broad  region  flanking  an 
edge  (see  figure  14,  also  [BINF81]),  especially  with  intensity  changes  other  than  steps.  As  a  result,  feature 
points  from  a  gradient  operator  have  to  be  thinned,  a  process  that  makes  it  difficult  to  localize  the  position 
of  the  edge  as  accurately  as  with  second  difference  operators.  On  the  other  hand,  errors  grow  rapidly  as 
differences  arc  taken,  so  that  second  differences  arc  much  noisier  than  first  differences. 

A  recent  edge  finder,  which  appears  to  work  well  on  a  range  of  natural  images,  is  due  to  Nevada  and 
llabu  (NFVA78J.  It  applies  the  six  gradient  operators  shown  in  figure  13  to  each  point  of  an  image  and 
chooses  the  one  giving  the  best  response  if  (1)  it  is  high  enough  and  (2)  it  is  not  dominated  by  tire  responses 
at  neighboring  points  in  a  direction  which  is  normal  to  the  same  apparent  edge.  TTiis  process  is  followed  by 
thinning,  thresholding,  and  line  fitting.  Some  indication  of  the  performance  of  the  Nevatia-Babu  algorithm 
can  be  seen  in  figure  14. 

Dinford  has  argued  that  it  is  important  u>  distinguish  between  the  detection  of  an  intensity  change  and 
us  subsequent  localization  (KINF81|.  He  suggests  that  a  maximum  of  a  noisy  signal  is  good  for  detecting 
change  but  not  for  isolation.  Conversely,  a  zero  crossing  is  ideal  for  localizing  change  but  not  for  detection. 
M.ie Vicar- Whelan  and  llinford  find  adj.iccnt  pixels  between  which  a  second  diflcrcncing-like  operator  changes 
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Figure  11.  A  selection  of  musks  from  (he  image  understanding  literature  used  to  compute 
approximations  to  the  first  derivative  of  an  image  in  the  x  direction. 


sign  IMACV81].  Using  linear  interpolation  they  claim  to  be  able  to  localize  intensity  changes  with  sub-pixel 
accuracy.  Sub-pixel  accuiacy  is  also  claimed  by  |MAKR7')|  in  the  context  of  vernier  acuity,  where  the  eye  is 
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Figure  12.  Ihe  response  of  an  edge  and  bar  operator  to  an  ideal  step  change  in  intensity,  a.  The 
intensity  change,  b.  The  response  of  a  typical  first  difference  edge  operator  such  as  that  shown 
in  figure  I  la.  c.  The  response  of  a  typical  bar  operator  such  as  that  shown  in  figure  He. 

able  to  perceive  breaks  in  lines  that  arc  more  closely  spaced  than  the  physiology  of  the  eye  would  seem  to 
permit  (MARR79]. 

Real  images  are  further  complicated  by  dcfocussing  and  the  frequent  occurence  of  slow  intensity 
gradients  across  large  portions  of  the  image,  Humans  are  largely  unaware  of  slow  linear  intensity  gradients 
(I.AND71,  MCCA74].  This  seems  to  be  because  of  "lateral  inhibition”,  where  the  image  is  processed  by 
"center  surround"  operators  (figure  15)  that  resemble  rotationally  symmetric  second  differential  operators. 

Ilcrskoviis  and  Hinford  (I  II-RS70J  proposed  an  early  taxonomy  for  the  intensity  changes  they  found  in 
images  of  polyhedra,  classifying  them  as  "step",  "roof",  or  "edge"  changes  (figure  16).  As  we  shall  elaborate 
below,  they  proposed  different  operators  F,lep,  Froof,  and  F.dg,  1°  detect  each  different  type  of  intensity 
change.  It  is  commonly  supposed,  especially  in  applications  where  scenes  arc  effectively  flat,  that  the  majority 
of  intensity  changes  are  of  the  simple  step  type.  Many  detect  iop  schemes  arc  picdicated  upon  this  a-  sumption. 
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HR#ff  13.  The  masks  used  by  |Neva78]  to  compute  Out  derivatives  of  an  image  at  30  degree 
intervals. 


i 


»*«*  H.  S;«tiifife  .cm, Its  ,.l  ..inning  ihc  Ncvmh  ml  RiImi  operator  over  »  natural 


image. 
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Figure  15.  A  center  surround  operator. 

Hcrskovits  and  Binford  [HHRS70]  and  Horn  [HORN77J  observe  that  step  edges  typically  correspond  to  depth 
or  reflectance  boundaries,  whereas  the  equally  important  class  of  intensity  changes  corresponding  to  surface 
orientation  discontinuities  often  give  rise  to  roof  and  edge  transitions.  Marr  refined  the  Hcrskovits  and  Binford 
classification  to  include  "extended  edge”,  and  "thin  and  wide  bar”  (figure  17)  and  proposed  a  variety  of 
operators  of  different  sizes  to  discriminate  between  them  [MARR76a). 

The  construction  of  a  primal  sketch  representation  from  an  image  has  three  distinguishable  stages:  (1) 
"feature  points"  arc  detected  at  which  the  intensity  change  is  deemed  to  be  significant;  (2)  feature  points 
arc  grouped  to  form  line  segments,  or  small  closed  contours;  (3)  these  line  segments  arc  interpreted  as  scene 
events,  say  as  bounding  contours  or  as  true  edges  of  visible  surfaces.  These  three  stages  arc  discussed  in  turn  in 
the  following  subsections. 

the  ojKrators  shown  in  figure  1 1  arc  directionally  selective.  Some  authors  have  proposed  lire  use  of  roll.* 
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Figure  16.  The  taxonomy  of  intensity  profiles  proposed  by  Herckovits  and  Binford.  a.  idealization 
b.  examples. 

tionally  symmetric  operators,  such  as  the  (.apiarian  A,  for  edge  detection  [BRAD81b].  Several  reasons  have 
been  advanced.  Some  authors  prefer  theoretical  arguments,  noting  the  (near)  isotropy  of  human  vision  and 
the  fact  dial  the  center  surround  operators  giving  lateral  inhibition  arc  rotationally  symmetric.  Others  have 
stressed  practical  considerations.  For  example,  in  her  discussion  of  the  Marr-Hildrcth  theory  of  edge  detection 
(to  be  discussed  in  section  3.1.1),  Hildreth  (I  I  IIJ>80.pagc  13)  notes  that  "a  number  of  practical  considerations, 
which  will  be  illuminated  in  the  discussion  of  the  implementation,  suggested  that  the  . . .  operators  not  be 
directional”.  Suppose  instead  that  directional  operators  arc  used.  Most  algorithms  for  finding  feature  points 
have  two  stages:  first,  the  image  is  convolved  with  directional  operators  in  "sufficiently  many"  directions,  and 
second,  the  outputs  arc  combined  to  determine  die  orientation  and  extent  of  intensity  changes.  Regarding 
the  first  stage,  both  Marr  and  Hildreth  [MARRSOa,  page  193)  and  Hildreth  (1111.1)80.  page  40)  comment 
on  the  cost  of  convolving  with  a  "sufficient”  number  of  operators.  They  show  that  a  single  rotationally  svm* 


31 


Figure  17.  Marr's  classification  of  the  intensity  changes  that  occur  in  natural  images.  Alter  figure 
2  of  (MARR76a] 

metric  operator  (the  l.aplacian)  gives  precisely  the  same  results  if  a  condition  called  "linear  variation"  holds. 
Regarding  the  second  stage,  Hildreth  [HILD80,  page  36]  observes  that  edges  in  a  direction  dose  to  that  of 
the  mask  arc  elongated  ("smeared")  in  the  direction  of  the  mask.  She  also  notes  that  operators  at  several 
orientations  give  significant  responses  to  any  given  edge,  and  that  combining  the  responses  is  non-trivial. 
Other  authors  arc  less  convinced  of  the  need  for  rotationally  symmetric  operators  for  edge  finding  (RINF81). 


The  issue  of  control  arises  in  edge  finding  as  it  docs  in  all  other  areas  of  image  understanding.  It  has 
been  argued  that  it  is  not  possible  to  find  significant  intensity  changes,  group  them,  or  interpret  them  without 
engaging  quite  high  level  knowledge.  Bajcsy  and  Tavakoli  [I1AJC7S,  BAJC76HJ  were  early  proponents  of  this 
view,  as  was  Sliirai  |SHIR73|.  Davis  and  Koscnfcld  survey  the  application  of  relaxation  processing  to  isolate 
feature  poinls|l)AVI81], 
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3.U.  Finding  feature  points. 

Although  many  of  the  published  schemes  for  detecting  and  isolating  feature  points  were  discovered 
empirically,  there  have  been  three  main  approaches  to  making  edge  finding  more  precise.  The  first  consists 
of  locally  modelling  the  image  by  a  parameterized  analytic  surface  and  determining  the  best  fitting  choke 
of  parameters  given  the  actual  intensity  distribution.  The  second  is  Binford’s  application  of  signal  theory  to 
edge  finding.  Finally,  Marr  (M  ARR76a]  and  Marr  and  Hildreth  |MARR80]  have  developed  a  theory  of  edge 
finding  in  the  human  visual  system  that  takes  account  of  neurophysiology  and  psychophysics.  We  discuss  each 
of  these  approaches  in  turn. 

Surface  fitting 

The  derivation  of  operators  to  approximate  first  and  second  differences  by  least  squares  surface  fitting 
was  introduced  by  Prewitt  (PRHW70),  and  Hueckcl  [HUEC71J.  [BR0078,  HUMM79.  HARA80]  give  good 
introductions  to  the  method.  In  the  simplest  case,  where  noise  considcrations-are  ignored,  two  things  must  be 
chosen:  (1)  the  size  of  the  local  neighborhood  or  window  in  which  the  surface  will  be  fit,  and  (2)  the  function 
to  approximate  the  image  surface  in  the  window.  For  simplicity,  we  choose  a  window  of  size  2  by  2  and 
approximate  the  image  surface  in  such  a  window  by  a  plane  P(x,  y)  =  ax  +  by + c.  Haralick  [H  ARA80]  calls 
this  the  "sloped  facet"  model.  Assuming  that  the  response  of  an  edge  operator  is  independent  of  the  choke  of 
coordinate  origin,  we  assume  that  the  window  covers  *  =  0, 1;  y  =  0, 1  (figure  18).  We  determine  the  best 
fitting  choke  of  parameters  a,  b  and  c  by  least  squares  minimization  of  the  difference  between  the  intensity 
values  actually  found  in  the  window  and  those  predkted  by  the  function  P(x,  y).  The  square  of  this  difference 
is  given  by 


e*  -(•  +  *  +  e  -  /(I,  I))2  +  (a  +  c  -  HI,  Q))2  +  (b  +  e  -  7(0,  l))2  +  (c  - 1(0, 0))2). 


For  a  least  squares  fit.  wc  first  set 


** 
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This  implies 


2a  +  6  + 2c  =  7(1,1) +  7(1,0). 
Similarly,  setting  ^  and  ^  equal  to  zero,  we  get 


o  +  26  +  2c  =  7(1,1)  +  7(0,1), 


and 


2o  +  26  +  4c  =  7(0, 0)  +  7(1, 0)  +  7(0, 1)  +  7(1, 1). 


Solving,  we  sec  that 


2o  =  7(1, 1)  +  7(1, 0}  -  7(0, 1)  -  7(0, 0), 


and 


26  =  7(1,1)  +  7(0,  l)_  7(1,0) -7(0,0). 

The  gradient  of  P(x,  y)  in  the  *-dircction  is  =  a.  Similarly,  =  6.  We  can  depict  the 

gradient  operators  a  and  6  as  in  figure  18. 

Haralick  has  extended  the  basic  scheme  illustrated  above  to  model  the  effect  of  sensor  noise  [HARA80J. 
I  Ic  adds  a  normally  distributed  noise  term  f?(z,  y)  to  the  function  P(x,  y)  and  shows  that  an  l-'-tcst  is  ap¬ 
propriate  for  deciding  whether  or  not  there  is  a  significant  change  in  the  slope  of  adjacent  sloped  facets.  Here 
"significant"  is  given  its  usual  1%  statistical  meaning. 
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Figure  18.  a.  The  2  by  2  window  covering  pixels  (0.0)  to  (1.1).  b  and  c.  The  gradient  operators 
that  result  from  best  filling  a  plane  ("sloped  facet")  in  the  window  shown  in  a. 

Brooks  (BR0078]  considers  fitting  planes  and  quadratics  to  3  by  3  windows.  The  best  fit  plane  gives  the 
Prewitt  operator  shown  in  figure  11,  and  the  second  derivative  of  the  best  fit  quadratic  gives  the  bar  mask 
shown  in  figure  11.  Brooks  observes  that  the  dot  product  of  the  gradient  operators  o  and  b  in  figure  18  is 
/.cro.  This  suggests  that  it  may  be  possible  to  develop  an  orthogonal  set  of  increasingly  higher  order  masks. 
One  natural  choice  for  such  an  orthogonal  set  is  the  set  of  Fourier  basis  functions.  Other  choices  arc  Walsh  or 
I  ladamard  functions.  The  best  fitting  choice  of  Fourier  basis  functions  was  developed  by  Hucckel  in  an  early 
application  of  the  function  fitting  idea  [I  IUFC71).  O'Gorman  proposed  the  use  of  best  fitting  Walsh  functions 
(OGOK76). 

Hinjohl's  signal  theory  itpprooch 

Recently.  Binford  |BINF8I|  has  outlined  an  apprtKtch  to  edge  finding  dial  has  its  roots  in  two  early  un¬ 
published  papers  |l  II-RS70. 1 IORN73J.  The  details  arc  not  completely  dear  and  would  be  a  valuable  addition 
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to  the  literature.  It  was  noted  above  that  image  noise  makes  it  difficult  to  determine  reliably  which  intensity 
changes  arc  significant.  Herskovits  and  Binford  showed  how  to  estimate  the  signal  to  noise  ratio  for  an  image, 
and  determined  that  the  error  is  typically  about  1%  for  a  zero  signal.  They  studied  intensity  profiles  in  scenes 
of  polyhedra  and  proposed  the  classification  shown  in  figure  16.  The  response  of  a  bar  mask  to  an  ideal  step 
edge  is  shown  in  figure  19  (see  also  ]MARR76a],  Clearly,  as  the  number  of  points  in  the  bar  mask  increases, 
the  operator  can  detect  steps  of  lesser  heights  more  reliably.  Herskovits  and  Binford  make  this  idea  more 
precise  by  defining  the  sensitivity  of  an  operator  as  the  signal  for  which  detection  is  50%  successful. 

The  intensity  values  determined  by  sensors  arc  most  reliable  in  the  middle  range.  Accordingly,  Herskovits 
and  Binford  [HF.RS70,  page  36]  suggest  upper  and  lower  thresholds  u  and  /  on  intensity.  The  ideal  step  gives 
rise  to  a  band  of  u’s  flanked  by  a  band  of  Fs.  Define  L  to  be  the  number  of  points  at  which  the  value  is  u  in 
the  left  band  minus  the  number  of  points  at  which  the  thrcsholdcd  intensity  is  /.  Similarly,  R  is  the  number 
of  points  in  the  right  band  at  which  the  thrcsholdcd  value  is  u  minus  the  number  at  which  the  value  is  l.  If 
F, up  =  L  —  R  is  big  enough,  a  local  maximum  is  found.  In  this  way  the  step  is  detected  though  not  localized. 

Figure  19  also  shows  the  response  of  a  bar  mask  to  an  ideal  roof  intensity  change.  Note  that  unlike  step 
changes,  the  response  reaches  a  maximum  in  the  vicinity  of  the  top  of  the  roof.  Accordingly  an  operator  Froof 
is  defined  as  the  difference  R  +  L,  that  is  the  difference  between  the  number  of  values  u’s  and  /' s  summed 
over  both  bands. 

A  refinement  of  the  scheme  is  described  in  [BINF81J.  The  operator  F,(ep  approximates  the  derivative 
of  the  second  derivative,  or  equivalently,  detects  the  step  intensity  change  by  looking  at  the  third  derivative 
of  intensity.  Ihc  intensity  change  is  then  localized  from  the  zero  crossing  of  the  second  derivative.  A  roof 
change  is  detected  from  the  maximum  of  the  second  derivative  and  localized  from  the  zero  crossing  of  the 
third  derivative. 

The  operators  F.ttp,  Fronj ,  and  a  similar  one  for  "edge  effects”  were  incorporated  in  the  Binford-Horn 
line  finder  (HORN 73)  and  discussed  retrospectively  in  [BINF81J. 

Mart's  approach  to  edge  detection  by  the  human  visual  system 


Figure  19.  Response  of  u  bar  mask  to  an  ideal  step  (u)  and  roof  edge  (b).  1.  The  intensity 
change.  2.  Response  to  a  lateral  inhibition  operator.  3.  Derivative  of  2. 


A  novel  feature  of  Marr's  development  of  the  primal  sketch  |MAKK76a]  was  its  direct  reference  it* 
noiirophyMolopy  and  psychophysics,  a  commitment  Mm  continued  to  stress  in  later  work.  Marr's  algorithm 
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for  computing  the  primal  sketch  from  an  image  had  a  number  of  interesting  features.  First,  being  inspired 
by  neurophysiology,  Marr  applied  the  findings  of  Hubei,  Wiesel,  Barlow,  and  others,  which  seem  to  suggest 
dial  an  early  stage  in  the  processing  of  visual  information  consists  of  convolving  the  image  with  edge  and 
bar  masks.  As  we  observed  above,  such  masks  signal  an  approximation  to  the  first  and  second  (directional) 
derivatives  of  the  intensity  function.  Marr  based  his  algorithm  on  an  analysis  of  the  response  of  bar  and  edge 
masks  to  ideal  instances  of  the  scene  events  that  give  rise  to  intensity  changes.  'Hie  algorithm  itself  consisted 
of  convolving  an  image  with  a  number  of  edge  and  bar  masks  and  then  "parsing"  the  results  by  comparing  the 
actual  responses  to  those  predicted  for  ideal  scene  events.  It  was  noted  that  bar  masks  seemed  to  give  more 
reliable  information  than  edge  masks,  an  observation  whose  explanation  awaited  the  later  development  of 
A G  operators  which  have  a  similar  cross  section  (sec  below).  The  algorithm  convolved  the  image  with  masks 
of  different  panel  widths.  Although  the  later  justification  for  this  would  be  in  terms  of  separate  processing 
channels,  the  original  explanation  was  based  on  tire  need  for  noise  reduction,  although  this  idea  was  never 
formulated  precisely.  In  any  ease,  the  outputs  of  the  individual  channels  were  combined,  not  only  to  reduce 
the  effects  of  noise,  but  to  compute  measures  such  as  the  "fuzziness”  of  an  edge.  The  idea  of  combining 
the  outputs  of  independent  channels  remains  an  important  goal  of  the  work  on  zero  crossings,  but,  with  the 
singular  exception  of  stereo  (see  below),  it  has  not  yet  been  worked  out. 

Marr  and  Hildreth  [MARR80,  page  189]  point  out  that  "a  major  difficulty  with  natural  images  that 
changes  can  and  do  occur  over  a  wide  range  of  scales,  so  it  follows  that  one  should  seek  a  way  of  dealing  with 
the  changes  occuring  at  different  scales.”  One  way  to  do  this,  which  has  been  proposed  several  times  in  the 
image  processing  literature,  is  to  pass  the  image  through  a  number  of  band  limited  filters.  The  difficult  issues 
raised  by  the  idea  concern  the  choice  of  filters  (bar  mask,  Fourier,  Gaussian),  the  number  of  them,  and  the 
exact  band  pass  characteristics  of  each. 

Intensity  changes  arc  localized  in  space,  a  fact  which  derives  from  their  physical  causes  (IIORN77, 
MARR76,  MARR80a).  Marr  and  Hildreth  argue  that  they  arc  also  localized  in  the  frequency  domain.  Marr 
and  Hildreth  |MAKR80.  page  191)  note  that  "unfortunately,  these  two  localization  requirements,  the  one  in 
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the  spatial  and  the  other  in  the  frequency  domain,  arc  conflicting".  'Hie  Fourier  transform  of  a  bar  mask  has 
components  of  arbitrarily  high  frequency.  Similarly,  the  inverse  transform  of  a  barlike  band  pass  filter  in 
the  Fourier  domain  has  significant  "echoes";  [HILD80]  gives  examples.  They  point  out  that  a  Gaussian  filter 
optimizes  localization  in  both  domains  simultaneously,  and  so  it  is  chosen  as  the  band  limiting  filter  in  their 
theory. 

For  the  practical  considerations  given  in  the  introduction  to  this  section,  Marr  and  Hildreth  propose  the 
use  of  a  rotntionally  symmetric  operator  to  find  feature  points.  An  obvious  candidate  is  the  Laplacian  A  (see 
[BRAD81]  for  a  discussion  of  rotationally  symmetric  operators).  The  Marr  and  Hildreth  approach  to  edge 
finding  follows  Gaussian  smoothing  by  convolving  the  image  with  a  Uplacian,  thus  isolating  the  positions  of 
zero  crossings.  In  fact,  by  the  convolution  theorem  (BRAC65,  page  118], 

A(G*  image)  =  (AG)*  image, 

where  G  is  a  Gaussian  operator,  and  *  denotes  convolution.  Marr  and  Hildreth  [MARR80,  page  193]  point 
out  that  the  AG  operator  closely  resembles  the  difference  of  Gaussian  (DOG)  operators  proposed  by  Wilson 
and  Giese  [WII  .S77]  (see  also  |WII  .S79]).  Indeed  they  show  that  AG  is  the  limit  of  a  DOG,  and  that  the  DOG 
closely  approximates  it.  The  two-dimensional  cross  section  of  the  AG  operator  is  shown  in  figure  20a.  It  can 
be  thought  of  as  a  smoothed  version  of  a  bar  mask  cross  section,  and  may  explain  Marr's  heuristic  preference 
for  bar  masks  over  edge  masks  mentioned  earlier.  Wilson  and  Bergen’s  work  suggests  that  there  should  be 
four  bandpass  channels  at  each  retinal  eccentricity,  and  that  their  characteristic  sizes  should  scale  linearly  with 
eccentricity,  being  smallest  in  the  fovea  and- doubling  in  size  by  about  4°. 

Shaumugam.  Dickey,  and  Green  investigated  the  characteristics  of  the  optimal  frequency  domain  filter 
for  edge  detection  |SIIAN79|.  By  "optimal"  they  mean  the  filter  that  produces  the  maximum  energy  in  the 
vicinity  of  the  location  of  a  (step)  edge.  Jernigan  and  Warded  JJ1-RN81J  have  shown  that  there  is  no  significant 
difference  between  the  optimizing  filter  derived  by  Shamnugain,  Dickey,  and  Green,  and  the  difference  of 
<  iaussian  Idler  proposed  by  Wil  on  and  Bergen.  The  chaiaetei  istics  of  the  Shanmugan,  Dickey  and  Green 
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filter  arc  largely  determined  by  a  constant  c  that  is  the  product  of  die  frequency  domain  bandwidth  of  the 
optimal  filter  and  its  spatial  interval.  As  c  increases,  the  signal  to  noise  ratio  increases.  However,  for  fixed 
bandwidth,  the  improved  signal  to  noise  ratio  is  achieved  at  the  expense  of  resolution. 

Recently,  Marr,  Hildreth,  and  Poggio  have  noted  evidence  for  a  fifth,  smaller  channel  in  the  fovea 
[MARR79a],  Brady  [BRAQ80a]  has  shown  how  the  Marr-Hildreth  theory  can  be  used  to  explain  a  number  of 
psychophysical  results  about  parafoveal  processing  in  reading. 

Figure  21  shows  images  of  a  leaf  and  a  coffee  jar  which  has  been  sprayed  with  black  paint  to  provide 
a  textured  surface  for  stereoscopic  fusion  (see  below).  Figures  22  and  23  show  the  images  in  figure  21 
filtered  respectively  through  the  coarsest  and  finest  resolution  channels  in  the  fovea.  Figure  24  shows  the  zero 
crossings  of  the  (.apiarian  applied  to  the  filtered  images  shown  in  figures  22  and  23. 

One  of  the  novel  aspects  of  the  implementation  of  the  theory  concerns  Die  sizes  of  the  AG  operators. 
Kdgc  finding  operators  arc  typically  at  most  7  pixels  square;  the  smallest  operator  used  in  the  implementation 
of  the  Marr-Hildreth  theory  at  MIT  is  35  pixels  square.  Not  only  are  the  resulting  operators  much  closer 
approximations  to  the  Gaussian  (or  any  other  filter  for  that  matter),  but  the  signal  to  noise  characteristics  of 
the  smoothed  images  is  vastly  improved.  One  practical  consequence  of  this  seems  to  be  that  for  computing 
the  orientation  of  visible  edges  one  can  approximate  differential  operators  by  simple  difference  operators. 
Conventional  edge  finding  operators  confound  filtering  and  differentiation,  and  have  poor  and  essentially  un¬ 
predictable  filter  characteristics.  The  first  implemented  version  of  the  Marr-Hildreth  theory  took  on  the  order 
of  three  hours  to  compute  the  zero  crossings  in  the  coarse  channel  of  an  image  512  pixels  square.  A  prototype 
hardware  implementation  reduced  this  to  30  minutes.  Nishihara  and  I  .arson  report  a  I’ll,  implementation 
that  computes  and  displays  the  zero  crossings  in  any  channel  of  an  image  128  pixels  square  in  under  0.25 
seconds  [NISH81J. 

Directional  selectivity  fur  motion 

Marr  and  Ullman  |M  A  RR81J  investigate  the  possibility  that  the  time  rate  of  change  of 


20.  (a)  Two  dimensional  eras  section  or  (tie  A( 7  operator  showine 

STtSSSH**™ "  h0'“"  *"* <w  S.^SSfiS 


5{*,y,<)»(AC7)*/(*,v,2). 


41 


tlaK  21.  imn*w  of  (a)  a  tear  and  <b)  a  coffee  jar  grayed  lo  p«kHk*  a  teaUircd  •irfacc. 
(KcpnxhKcil  fr«nn  (a)  |l  1114)3(4  “*<1  lb)  (CiKIMKOj) 


l«w«  2L  The  renuH of howdrww  Hhcrio*  the  mw*c» tftown  in  ««we  21  UiuiiiHilutc  the  toftenwltow 
\uUiMc  through  the  uewwsl  chewed  to  Hie  huwtew  feet* 
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Vkmt  23.  Tlic  result  of  bandpass  filtering  the  images  shown  In  figure  21  6»  simulate  tire  mfirrmation 
available  dinuirJi  the  lineal  channel  hi  the  hainan  fovea. 
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can  enable  one  to  detect  the  direction  of  motion  of  zero-crossings.  Define 


so  that 


T[x,  y,t) 


<9S(x,  y,  t) 
dt  ’ 


r(iiy,<)  =  ag* 


y,  <) 

dt 


Figure  25  is  based  on  [MARR81,  figure  3J.  It  shows  the  response  of  S(x,  y,  t)  and  T(x,y,t)  in  the 
vicinity  of  an  isolated  intensity  edge.  Notice  that  for  motion  to  the  right,  T(x,  y,  t)  is  positive  at  the  zero 
crossing,  while  for  motion  to  the  left  it  is  negative.  Marr  and  Ullman  propose  that  motion  to  the  right  can 
be  detected  by  the  simultaneous  activity  of  S+,  T+,  and  S~.  On  the  basis  of  this  analysis  they  find  close 
agreement  at  moderate  speeds  between  theoretical  predictions  and  cell  recordings  (see  figure  15).  Richter 
and  Ullman  [RICH80]  have  accounted  for  the  discrepancy  at  high  speeds,  and  generally  refined  the  model 
of  directional  selectivity,  by  noting  that  the  two  Gaussians  whose  difference  approximates  AG  act  like  RC 
filters,  composed  of  a  resistor  and  a  capacitor,  with  different  time  constants.  Ibis  causes  a  slight  delay  in  the 
onset  of  the  negative  outer  part  relative  to  the  positive  central  part  Richter  and  Ullman's  predictions  show 
remarkable  agreement  with  cell  recordings  for  a  wide  variety  of  stimuli  (see  figure  26).  Coincidentally,  Richter 
and  Ullman  have  proposed  a  theoretical  structure  for  the  outer  plexiform  layer  of  the  human  retina  in  which 
A G  is  computed.  This  suggests  a  particular  VLSI  implementation  of  AG.  The  general  scheme  is  illustrated  in 
figure  27. 


3.1.2  Grouping  feature  points. 

The  methods  of  the  previous  section  produce  a  set  of  feature  points  (figure  28)  corresponding  to  pliices  in 
the  image  at  which  the  Intensity  change  is  considered  significant.  Ilie  next  stage  of  processing  imposes  struc¬ 
ture  on  the  sea  of  individuated  feature  points  by  grouping  them  to  form  extended  contours.  Marr  |MARR76, 


l  iRBrc  25.  IXriviiiHin  of  the  STS  operator  proposed  by  M«rr  and  Ullman  for  computing  directional 
sd>  clivily  or  motion,  (a)  I  lie  response  of  a  vcrtic.il  contrast  Imundary  at  time  t  to  a  AC?  operator, 
showing  the  position  /  of  the  zero  crossiii;-  (h>  At  time  (H  ilt)  the  edge  lias  moved  slightly 
to  the  right.  Subtracting  viilih  tin  approx imal Kin  to  Hx.y.t).  Notice  that  I  is  ixisitive  at  /.  (c) 
analogously,  an  edge  moving  to  the  left  is  detected  by  a  negative  value  liu  I  at  r  (Kcptoduccd 
ftoin  (MAKKXl,  bg.itrc  JJ 
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Figure  27.  Spatial  formation  of  a  midget  bipolar  receptive  field  in  the  Richter- IJIIman  model,  (a) 
I  he  arrangement  of  tones  anil  horizontal  cells.  I  nch  hnri/ontal  cell  covers  a  titt  le  (ihe  shaded 
area)  with  a  ladins  three  times  larger  than  the  cone  pedicle  (the  dots),  it  contacts  7  cones,  thus 
••even  hoti/oiiial  ciHs  contact  each  cone,  connecting  a  total  ol  I'l  tones  to  cicate  the  surround 
aiea  of  a  nnd'.  ct  bipolar  oil.  (It)  Ihe  contribution  to  the  Mirioimtl  of  the  fust.  s-contl.  and  third 
ring  ol  cones.  Ihe  receptice  held  of  a  midget  Inpolar  cell  resiihing  Irom  the  center  contribution 
ol  one  cone  and  Ihe  ahovc  sinionnd  is  slum u  in  5  and  a  sine  through  its  center  is  shown  in  6. 
(figure  reproduced  front  |l<l('IIXii.  lignrc  3| 


49 


page  501]  argues  that  "grouping  processes  are  available  precisely  because  they  are  needed  to  help  interpret 
the  primal  sketch;  and  furthermore  that  these  symbolic  processes,  together  with  first  order  discriminations, 
operating  recursively  on  the  description  of  the  primal  sketch,  are  sufficient  to  account  for  most  of  the  range  of 
’non-attentive’  vision  of  which  we  are  capable." 

We  may  assume  that  there  are  few  accidental  alignments  of  object  boundaries,  shadows,  reflectance 
boundaries,  and  surface  discontinuities  (also  called  "true  edges")  in  the  scene,  that  is,  the  image  is  taken 
from  "general  position".  Then  nearby  feature  points  mostly  arise  from  nearby  scene  points  and  for  the  same 
underlying  physical  cause.  It  follows  that  the  descriptions  associated  with  adjacent  feature  points  that  are  per¬ 
ceptually  grouped  are  very  similar.  If  feature  points  have  reliable  and  rich  descriptions,  perceptual  grouping 
can  be  more  effective.  Similar  considerations  apply  to  other  cases  of  local  matching  in  vision  such  as  stereo, 
motion  computation,  and  the  determination  of  texture. 

Each  of  die  mediods  for  finding  feature  points  described  in  the  previous  section  has  associated  grouping 
processes.  For  example  the  Binford-Hom  line  finder  compares  feature  points  locally  on  the  basis  of  the  size 
of  the  contrast  step  across  the  intensity  change,  the  type  of  intensity  change,  and  the  slope  of  the  gradient 
|HORN73,  page  7).  Marr  [MARR76,  page  503]  also  groups  feature  points  on  the  basis  of  "orientation, 
contrast,  typefEIXJE,  LINE,  etc.),  and  fuzziness".  He  notes  that  "the  first  stage  of  grouping  combines  two 
elements  only  if  they  match  in  almost  all  respects,  are  very  close  to  one  another,  and  if  there  are  no  other 
candidates."  T  ypical  results  of  this  process  arc  shown  in  figures  29  and  30.  Marr  proposes  a  number  of  opera¬ 
tions  that  group  the  short  line  segments  produced  by  the  first  stage  on  the  basis  of  collinearity,  proximity,  and 
similarity  of  slope  (MARR76a|.  Ihc  results  of  these  operations  arc  histogrammed  locally  and  the  dominant 
structures  made  explicit.  Figure  29b  shows  die  herring  bone  stripes  computed  from  figure  29. 

Many  images  contain  extended  straight  contours,  mostly  corresponding  to  the  straight  edges  that  prevail 
in  our  man-made  environment.  IXida  and  Hart  |I)UI)A73]  and  O’Gorman  and  Clowes  (OGOR73J  popularized 
a  method  introduced  by  Hough  for  finding  straight  lines  in  images.  Ballard  (BALL79)  has  extended  the 
method  considerably,  and  we  follow  his  development  here.  Suppose  that  one  is  interested  in  discovering 
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instances  of  circles  in  an  image.  Ballard  proposes  to  find  the  circles  from  the  feature  points  that  form  their 
contours.  Let  there  be  a  feature  point  at  point  (x,  y),  and  suppose  that  the  gradient  of  the  intensity  change  is  in 
direction  0.  A  circle  is  uniquely  specified  by  three  parameters:  its  center  (a,  6)  and  its  radius  r.  To  pass  through 
the  feature  point  (x,  y),  such  a  circle  has  to  satisfy  the  constraint 

(i  —  a)2  +  (y  -  6)2  —  r2. 

The  gradient  slope  imposes  the  additional  constraint  r  —  (y  —  6)  see  6.  It  follows  that  each  feature 
point  constrains  the  circles  passing  through  it  with  the  given  slope  to  a  one  parameter  family.  As  before, 
adjacent  feature  points  normally  come  from  the  same  circle.  There  are  two  simple  techniques  for  combining 
the  additional  constraint.  First,  one  might  intersect  the  one  parameter  families  in  the  spirit  of  line  labelling 
(see  section  2).  The  noise  inherent  in  the  measurement  of  the  center  and  radius  suggests  that  something  akin 
to  a  relaxation  technique  be  used  to  find  optimal  circles.  Several  authors  have  suggested  such  an  approach 
[ZUCK77,  DAVI81].  Line  labelling  essentially  combines  evidence  by  an  AND  operation.  Alternatively  an 
OR  operation  can  be  used,  corresponding  to  a  summation  or  histogram.  To  accommodate  noise,  the  range  of 
possible  values  for  the  center  and  radius  arc  quantized  for  each  parameter  to  produce  an  "accumulator  array". 
K;tch  feature  point  contributes  one  vote  to  the  (o bj,  rk)  buckets  in  its  one  parameter  family.  Local  maxima  in 
the  accumulator  array  arc  assumed  to  correspond  to  instances  of  circles. 

Ballard  has  extended  the  Hough  transform  technique  of  combining  constraints  on  defining  parameter 
values  to  non-analytic  functions  and  has  shown  how  to  estimate  the  effects  of  noise  (BALL8 1 J. 

3.1  J  Interpreting  feature  point  segments  as  scene  events 

In  die  discussion  of  the  microworlds  in  section  2,  we  noted  the  key  contribution  of  Clowes  and  Huffman 
who  stressed  the  need  to  make  explicit  the  relationship  between  image  fragments  and  scene  events.  The  line 
labelling  schemes  of  I  luffnian.  Clowes.  Kaiindc.  Sugiharu,  and  Waltz,  and  the  surface  labelling  schemes  of 
Mack  worth.  Huffman,  and  Draper  all  developed  this  fundamental  idea.  Ocncrali/ing  from  the  blocks  world. 
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f  igure  29.  image  of  a  piece  of  herring-bone  doth  and  typical  stripes  extracted  front  it  on  the 
basis  of  slope  of  gradient  at  feature  points.  (Reproduced  from  |MARR76a,  figure  19|) 


One  would  like  It*  extend  line  interpretation  to  feature  point  segments.  Elongated  segments  correspond  to 
Innuul.iiies  that  mark  important  scene  events:  that  is  why  feature  points  were  isolated  in  the  first  place.  The 


Figure  30.  a.  An  image  <  ■'  a  piece  of  iwccd  and  the  feature  points  found  in  it  using  the  Mnrr- 
Hildrelh  theory  of  edge  detection.  The  figure  illustrates  grouping  on  the  basis  of  orientation  of 
the  gradient  of  feature  points,  b.  image  of  bricks  and  feature  points  grouped  on  the  basis  of 
contrast.  Reproduced  from  [IIILD80,  figure  251 


first  attempt  to  extend  blocks  world  labelling  schemes  to  real  images  seems  to  have  been  Bajcsy  and  Tavakoli's 
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model  bnsed  interpretation  of  aerial  photographs  [BAJC76a]. 

Marr  noted  a  correlation  between  different  types  of  intensity  change  and  the  scene  events  that  often  gave 
rise  to  them.  Entries  in  the  primal  sketch  were  marked  with  their  interpretation  in  the  scene,  such  as  "edge", 
"shading  edge",  and  "extended  edge”  [MARR76,  page  490].  With  the  development  of  zero  crossings,  and 
the  de-emphasis  of  bar  and  edge  masks,  it  is  unfortunately  no  longer  obvious  how  to  compute  the  assertions 
that  Marr  had  previously  advocated  for  inclusion  in  the  primal  sketch  [HILD80,  page  75].  The  whole  issue  of 
constructing  the  primal  sketch  from  zero-crossings  is  far  from  being  resolved. 

Binford  [BINI-81]  and  I  .owe  and  Binford  (LOWE81]  have  recently  made  an  initial  pass  at  the  problem 
of  interpreting  feature  point  segments.  Compared  with  the  blocks  world  labelling  schemes,  the  labellings 
that  Lowe  and  Binford  propose  are  very  general.  A  segment  is  interpreted  as  a  space  curve,  and  constraints 
formulated  on  coincidence  and  the  situations  in  which  a  curve  corresponds  to  a  bounding  contour  or  true 
edge. 
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3.2.  Determining  surface  shape  from  intensity  values 

Horn  and  his  colleagues  at  MIT  have  studied  the  perception  of  shape  from  grey  level  shading.  Ihc  input 
to  the  "shape  from  shading”  process  is  the  image  and  the  output  is  some  appropriate  representation  of  surface 
shape.  The  exact  form  of  the  latter  representation  is  not  yet  fixed,  although  (HORN82)  offers  some  thoughts. 
Since  we  can  perceive  surface  shape  locally,  in  scenes  with  little  or  no  semantic  content,  a  reasonable  first 
approximation  is  to  represent  the  shape  of  a  surface  by  its  local  surface  normal.  This  requires  two  parameters, 
say  p  and  q.  The  relationship  between  shape  and  the  intensity  /  at  a  point  ( x ,  y)  in  an  image  takes  the  form 

I(x,  y)  =  R[p,q), 

which  Horn  [HORN77J  calls  the  image  irradiar.ee  equation.  Mathematically,  die  image  irradiancc  equation  is  a 
nonlinear  first  order  partial  differential  equation.  Horn  [HORN77J  notes  that  the  function  R  encodes  the  posi¬ 
tion  of  the  viewer,  the  distribution  of  light  sources  (assumed  to  be  fixed),  and  the  reflectance  characteristics 
of  live  surface  material.  Horn  and  Sjoberg  [HORN79J  derive  the  relationship  between  the  function  It  and  the 
bidirectional  reflectivity  functions  used  by  photometrists,  and  they  show  how  to  calculate  it  in  particular  eases. 
One  important  special  ease  is  lainbcrtian  reflectance,  where  the  intensity  varies  as  the  vector  dot  product  of 
(he  local  surface  normal  and  live  direction  of  the  light  source. 

One  useful  parameterization  of  the  local  surface  normal  uses  the  partial  derivatives  p  =  $£  and  q  = 
where  the  viewed  surface  is  z  =  f(x,  y).  This  gives  rise  to  the  representation  introduced  in  Section  2  called 
gradient  space.  Two  comments  arc  in  order,  hirst,  since  slant  and  tilt  (as  defined  by  figure  7)  have  natural 
perceptual  meanings,  one  might  argue  that’thc  polar  form  of  gradient  space  is  preferred  by  the  human  visual 
system.  Stevens  (STHV80J  develops  this  argument,  and  some  further  support  for  the  position  is  provided  by 
(WITK81). 

Second,  there  is  a  basic  problem  with  gradient  space,  namely  its  inability  to  represent  occluding  bound¬ 
aries  at  which  the  mii face  turns  away  from  llie  viewer.  At  occluding  boundaries  the  slant  angle  is  J,  so 
that  its  tangent  (a  in  figure  7)  is  infinite  (note  that  this  objection  does  not  apply  to  using  the  angles  a  and 
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r  as  (STKV80I  notes.  Ikcuchi  and  Horn  |IKI-1U81]  introduce  a  different  parameterization  (f,g)  of  surface 
orientation  that  they  call  stereographic  space.  Formally,  /  and  g  arc  related  to  p  and  q  by 
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Ikcuchi  and  Horn  introduce  the  Gaussian  sphere,  and  show  that  gradient  space  corresponds  to  projecting  the 
Gaussian  sphere  onto  the  plane  from  its  center,  whereas  stenographic  space  is  the  result  of  projecting  from  the 
north  pole  (when  the  viewing  direction  is  from  the  south  pole). 

Although  it  cannot  represent  occluding  boundaries,  the  mathematical  development  associated  with 
gradient  space  is  easier,  and  so  it  is  used  in  most  of  this  section.  For  a  fixed  distribution  of  light  sources,  and 
fixed  reflectance  characteristics,  the  image  irradiancc  equation  associates  a  brightness  value  with  each  surface 
orientation.  Thus  we  can  assign  a  brightness  value  to  each  point  of  gradient  space.  Ihc  representation  is  then 
called  the  reflectance  /»wrfHORN77].  It  is  convenient  to  scale  brightness  values  to  the  range  [0, 1],  and  to  make 
iso-brightness  contours  explicit.  Figure  31  shows  the  iso-brightness  contours  for  a  1-ambertian  reflector  in  the 
ease  of  a  single  light  source  near  the  viewer.  Figure  32  shows  the  result  of  moving  the  light  source  away  from 
the  viewer,  while  figure  33  shows  the  reflectance  map  for  a  gloss  surface  which  approximates  white  paint. 

I  laving  set  up  the  representation  of  the  output  of  shape  from  shading,  we  now  consider  some  of  the 
algorithms  that  have  been  proposed  for  actually  determining  shape  from  an  image.  Recall  that  the  image 
ii  radiance  equation  is  a  (usually  nonlinear)  first  order  partial  differential  equation.  As  such,  it  can  be  ap¬ 
proached  using  one  of  the  standard  techniques  for  solving  paitial  differential  equations.  Horn  [HORN75J 
applied  the  characteristic  strip  method  of  solving  partial  differential  equations  to  reformulate  the  image  ir- 
radi  nice  equal  ion  as  a  set  of  five  ordinary  dillercntial  equations.  11k  solution  surface  is 
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I'fanm  31.  Iso-brightness  contour*  for  a  l^inbcrtiaa  reflector  when  Ihc  Hrfii  source  w  "f“r. 
observer.  the  iMtf limeys  til  a  |xmil  Is  detenniiicil  by  the  nranc  «.fihc  angle  between  the  hi. 
Mirlaec  nmnwl  ami  Hie  view  vector.  (Kepnahtml  Inrni  |»KWN/;.  Itgmv 
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Hr are  XX  Iso-hriphmew  contour.  fi*r  a  rcflecti*  (ha*  appmtmrahN  while  |k*a  paiitl.  Nmh*  lltc 
peak  relative  In  Ihv  Lunliriliun  icIk'Uor  drown  in  figure  13.  aim'qximlini  lo  the  mirror  like 
eomponeni  ol  relkxnon  ol  ghws  paint.  (Kcprorhiccil  fami  (IIOKN77.  figure  7| 
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<ind  tile  image  irradiance  equation  is 

/(*,  I/)  — «(p,9)=-0  (2) 

lTic  surface  normal  has  direction  ratios  (p,  q,  —1).  The  characteristic  strip  method  computes  the  solution 
surface  by  finding  a  family  of  space  curves  (strips)  whose  local  tangents  all  lie  in  the  tangent  plane  of  the 
solution  surface.  Such  a  curve  can  be  specified  by  a  one  parameter  family  of  points  (r(a),  y(a),  z(a)),  where  a 
corresponds  to  the  distance  traversed  along  the  curve.  Differentiating  equation  (1)  with  respect  to  «,  we  find: 

dx  dy  da  . 

's; +*;£-£ =0- 

It  follows  (hat  ( ,  $[,  ;^)  lies  in  the  tangent  plane  of  the  solution  surface.  Since  pRp  -f  qRq  —  (pRp  -f  qRq) 
is  identically  zero,  (Rp,Rq,pRp  -f-  qR^)  also  lies  in  the  tangent  plane.  Equating  these  two  vectors  gives  the 


following  three  equations: 


ix=R 

da  Hp' 
dy  __  p 

da 


d£=pRp  +  qR„ 

finally.  differentiating  equation  (2)  with  respect  to  x  gives: 


Iz  -  ■  RPPx  4-  R,,qr- 


Since  p„  -  —  qr.  we  find 


(r  RpPr  4"  RqV,„ 
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Similarly, 


The  characteristic  strip  formulation  was  used  by  Horn  [HORN7SJ  as  the  basis  of  an  iterative  computation 
as  follows.  Suppose  that  we  know  that  image  point  (in,  y„)  corresponds  to  a  surface  point  at  which  die  surface 
gradient  is  ( p„ ,  qu).  Refer  to  figure  34,  which  shows  iso-brightness  contours  passing  dirough  [xn,  yn)  in  die 
linage  and  (p„,  qn)  in  the  reflectance  map.  Consider  a  step  da  along  the  characteristic  strip,  from  (x„,  yn)  to 
(*n+ 1,  and,  correspondingly,  from  ( p„,qn )  to  (pn+i.gn+i).  The  five  ordinary  differential  equations 
given  above  show  that  the  step  in  the  image  is  in  the  direction  (R,„  Rv).  that  is  to  say,  along  the  normal  to 
the  iso-brightness  contour  in  the  reflectance  map.  Similarly,  the  step  in  the  reflectance  map  is  in  the  direction 
normal  to  the  iso-brightness  contour  computed  in  the  image.  In  diis  way,  knowing  die  reflectance  map,  one 
can  proceed  to  compute  a  sequence  of  points  and  local  gradients  along  die  characicrisdc  strip  starting  Horn  a 
point  in  the  image  at  which  the  surface  gradient  is  known.  Figure  35  illustrates  the  results  of  applying  I  lorn's 
algorithm. 

One  problem  with  this  method  concerns  the  choice  of  the  singular  iinugc  point  (au,  {*>)  required  to  start 
die  iterative  process  at  which  the  surface  gradient  (pu,  q>)  is  determined  uniquely  by  the  intensity  data.  A 
further  problem  is  that  Horn's  algorithm  depends  on  die  assumption  that  the  underlying  surface  is  locally 
convex  at  the  singular  point.  Finally,  die  class  of  image  irradiance  equations  for  which  Horn  s  algorithm 
works  was  unknown.  ( Ihc  latter  question  has  recently  been  answered  by  |I1KUS81|.)  Consequently  research 
was  directed  to  discover  the  criteria  under  which  the  shape  of  a  surface  is  uniquely  determined  by  an  image. 
One  suggestion  was  dial  bounding  or  occluding  contours  provided  such  conditions.  Along  such  contours,  the 
surface  normal  can  be  computed  exactly  front  the  image.  However,  occluding  contours  pose  a  problem  for 


KiR#re  34.  Ihc  basis  of  Horn’s  iterative  compulation  of  shnpe  front  shading  by  the  chnractcrislic 
strip  method.  The  stirlace  nacbcnl  at  the  intaps-  point  (*„,  y„)  is  knsmn  to  be  (p„,q„).  Iso- 
brii'luiK'ss  loniours  are  shiiv. n  in  the  itnape  am)  in  the  relies  lainr  map.  \  hmi  movement  in  the 
imave  alone  the  claims  ten  tic  strip  is  in  the  shtsslkm  of  the  solid  line.  >liicli  is  noinial  to  the 
i  at- In  ip  Inns  v,  contour  in  (lie  icllsiiance  map.  llic  converse  relation  also’  holds,  and  is  deputed 
In  the  (lotlesl  line. 
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FI***  35.  A  sample  result  of  Horn's  charnctcnstk  strip  algorithm.  Ilu-  fijatre  shows  the  picture  of 
a  mm-  with  sttpiiimptMd  ihar.cterWK  strips  (top  figure)  itml  commits  (Is mean  figure).  Repnahtced 
front  |IM)KN7j.  figure  IJ. 


ataaate  a 


64 


the  gradient  parameterization  of  local  surface  orientation,  namely  that  at  least  one  of  the  gradients  p  or  q  is 
infinite.  This  led  lkcuchi  and  Horn  [1KEU81]  to  propose  stenographic  projection  as  defined  above. 

Ikcuchi  and  Horn  [IKEU81]  note  some  additional  problems  with  the  characteristic  strip  method.  First, 
since  the  iterative  method  outlined  above  proceeds  unidirectionalty  along  a  characteristic  strip,  it  cannot 
exploit  boundary  conditions  at  both  ends  of  the  strip.  Second,  the  build  up  of  numerical  errors  along  any  in¬ 
dividual  strip  can  be  substantial.  A  novel  feature  of  Horn’s  {HORN75J  algorithm  is  the  simultaneous  develop¬ 
ment  of  several  characteristics  to  control  die  build  up  of  error  in  any  one.  Woodham  [WOOD81]  observes 
that  one  can  solve  for  surface  shape  if  one  makes  a  global  assumption  about  the  surface  type,  for  example  that 
it  is  convex,  a  ruled  surface,  or  the  surface  of  a  generalized  cylindcr(see  Section  6).  Other  authors  propose 
smoothness  constraints  derived  from  the  fact  that  die  integral  of  depth  around  a  closed  loop  in  die  image  is 
zero  [BR0079,  STRA79J.  Ikcuchi  and  Horn  [IKEU81]  discuss  a  more  direct  formulation  of  a  smoothness 
condition  dial  they  state  in  terms  of  the  stcreographic  parameterization  of  surface  orientation.  This  enables 
diem  to  use  the  bounding  contour  of  an  object  as  a  source  of  boundary  values  for  an  iterative  computation 
which  fills  in  the  surface  orientation  in  the  interior.  Formally,  denote  the  nth  iterative  approximation  to  the 
value  of  fi,j  at  image  point  (*,  j)  by  /JT  with  an  analogous  formula  for  gtj.  I. cuing  the  local  (four  point) 

f 

average  at  the  nth  iteration  be  J-  j,  Ikcuchi  and  Horn  derive  the  following  recurrence  relation  as  the  basis  of 
an  iterative  algoriUim  (IKHU81): 


f?j-' = /h + m,j  -  Rtf* 


Here,  ft,  is  the  partial  derivative  of  the  reflectivity  function  R  in  the  ease  of  stcreographic  projection, 
analogous  to  R,,  which  was  used  above  in  the  characteristic  strip  method.  The  resulting  algorithm  has  been 
tested  on  a  variety  of  images  and  works  well.  In  particular,  it  appears  to  degrade  gracefully  as  errors  arc 
introduced  to  the  placement  of  the  light  source,  the  surface  orientation  on  the  boundary,  and  the  nature  of 
the  reflectivity  assumed  for  the  surface.  Strong  empirical  evidence  is  provided  that  the  algorithm  converges, 
although  no  proof  is  demonstrated.  In  case  the  occluding  contour  is  partially  incomplete,  Ikcuchi  and  I  lorn  s 
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algorithm  still  appears  to  converge,  though  it  is  not  known  at  how  many  points  it  is  necessary  to  specify  the 
stcreographic  parameterization  of  the  surface  normal. 

Brass  |BRUS81]  has  recently  studied  some  of  the  mathematical  properties  of  the  image  irradiancc  equa¬ 
tion.  First,  she  has  shown  that  discontinuous  solution  surfaces  can  arise  from  a  continuous  image  irradiance 
equation.  It  follows  that  one  cannot  determine  for  a  continuous  image  irradiance  equation  whether  or  not 
there  is  an  edge.  The  curvature  of  a  surface  also  cannot  be  determined  in  general  from  its  image.  As  an 
example,  the  image  irradiance  equation  x2  -f  y2  —  p2  -f-  q2  has  two  different  solution  surfaces,  one  of  which 
z  —  xy  consists  entirely  of  hyperbolic  points,  while  the  other  z  =  ^(x2  +  y2)  consists  entirely  of  elliptic 
points.  However,  Brass  has  proved  that  there  is  only  one  solution  that  is  convex.  She  has  also  shown  that 
bounding  contours  can  be  determined  from  the  image  only  when  the  image  irradiance  equation  is  singular. 
Ihis  means  that  the  reflectance  function  II  and  its  first  order  partial  derivatives  arc  continuous,  while  the 
intensity  function  I  is  singular  in  x  and/or  y.  For  any  given  singulai  image  irradiance  equation  the  points  on 
the  occluding  contour  can  be  found  by  inspection  of  the  intensity  function  /(x,  y). 

Brass  also  studied  singular  "eikonal"  image  irradiance  equations  that  arc  of  the  form  p2  +  q2  =  l(x,  y). 
If  the  intensity  function  I(x.v)  vanishes  to  second  order  at  the  singular  point,  that  is  to  say  has  flic  form 

/(x,  y)  —  ax2  +  Pxy  ^ y 2  -f  0(|x3|  |y'|), 

then  there  is  exactly  one  positive  locally  convex  solution  surface  in  the  neighborhood  of  the  singular  point. 
This  result  is  applied  to  show  that  if  there  is  a  closed  bounding  contour,  the  solution  surface  is  unique  (up  to 
translation  along  flic  z  axis).  If  cither  the  reflectance  function  is  not  p2  -|  q2  —  l(x,  y),  flic  intensity  function 
docs  not  vanish  precisely  to  second  order,  or  there  is  not  a  smooth  closed  bounding  contour,  there  is  not  a 
unique  solution  surface.  The  reflectance  function  p2  -|-  q2  closely  models  a  number  of  practical  situations  such 
as  imaging  w  ith  scanning  electron  microscopes. 

Woodham  and  Horn,  Woodhant,  and  Silver  have  developed  a  rather  dillcrcnl  method  for  computing 
shape  from  shading  that  rnav  prove  very  important  in  practice,  even  if  it  hears  very  little  resemblance  to  the 
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processes  of  human  vision  (WOOD81,  HORN78b].  Suppose  that  we  fix  the  view  (camera)  position,  and  that 
we  set  up  two  light  sources  at  different  known  points.  Suppose  that  the  intensity  levels  at  any  image  point 
(x,  y)  in  the  first  and  second  images  arc  I\(x,  y)  and  h(z,  y).  The  first  of  these  restricts  the  surface  orientation 
at  (x,  y)  to  the  iso-brightness  contour  in  the  reflectance  map  corresponding  to  the  brightness  value  computed 
from  /|(x,  y)  (figure  36a).  Similarly,  the  surface  normal  is  constrained  by  the  iso-brightness  contour  defined 
by  hi*,  y)  (figure  36b),  and  hence  to  their  intersection  (figure  36c).  A  third  light  source  provides  complete 
disambiguation.  This  process  has  been  called  photometric  stereo,  and  can  be  implemented  very  efficiently  as 
follows.  First,  there  is  a  calibration  phase  in  which  an  object  whose  surface  shape  is  known,  such  as  a  sphere, 
is  illuminated  in  turn  by  the  set  of  light  sources  and  imaged.  This  generates  a  set  of  n-tuplcs  of  intensity 
values  (n  is  the  number  of  light  sources),  each  of  which  is  associated  with  a  known  local  surface  orientation 
on  the  known  calibration  object.  The  surface  orientation  distribution  of  an  unknown  object  can  then  be 
computed  by  using  the  n-tuplcs  of  intensity  values  at  each  corresponding  image  point  as  a  lookup  key  into  a 
table,  lb  keep  the  storage  requirements  of  the  algorithm  within  bounds,  the  intensity  values  arc  quantized. 
One  current  implementation  quantizes  intensity  to  ten  values  in  each  of  three  measurements.  Intermediate 
intensity  triples  arc  handled  by  intcfpolation  from  the  nearest  entries  in  the  table,  lbc  method,  which  has  been 
implemented  by  Silver,  is  fast  and  remarkably  accurate  [SII.V80],  Figure  37  shows  the  reconstruction  of  an 
egg  alter  a  calibration  phase  using  a  sphere.  Figure  38  is  the  superposition  of  a  cross  section  of  the  known 
surface  onto  one  computed  by  photometric  stereo.  Photometric  stereo  has  been  extended  to  handle  objects 
with  spccularitics  by  Ikcuchi  (1KFU81),  and  has  recently  been  applied  to  the  industrial  problem  of  bin-picking 
IDIRK8I]. 

Optical  flow 

In  Section  3.1.1,  we  surveyed  the  work  of  Marr  and  his  group  based  on  the  detection  of  the  important 
intensity  changes  in  an  image.  In  particular,  we  mentioned  the  recent  work  of  Mari,  Ullman.  and  Richter 
on  delecting  the  direction  of  motion  of  a  zero  crossing  by  taking  the  time  differential  of  AC*/(x,  y,  t).  We 
(oik  lode  ihis  section  with  a  hiief  discussion  of  the  work  of  Morn  and  Schunck  |IIOUN8h|  that  •imposes 


Kij*iirc  .16.  An  illusirnlion  of  photometric  stereo.  Supixise  (;i)  Ute  the  hiiclnness  nteasurctl  ill  the 
point  (x,  i/)  in  the  first  inviec  is  06  nml  (l>)  in  the  second  imx-e  the  l>n>.'hiiicss  nl  the  siime  point 
is  ().?.  (i)  superposition  of  the  fust  two  lonsti.iiiits  slums  ilinl  there  me  ;il  most  Iwo  tonsistnU 
sutfiiec  gudicnl.s. 
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Hj-fw  37.  1  he  reconstruction  of  an  egg  shape  by  Stiver's  implcment.ition  of  photometric  stem 
jSfJjJJ'"  p taB*  a  ,hc  refcrt"**  <*  "«  ««?**»  w»  Lambcitian.  (Reproduce 
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Figure  Comparison  of  the  cross  section  of  an  epg  and  a  knob  shape  cinnptiled  by  phoiomelric 
slere.i  (.olid  lines)  ami  the  true  uoss  sections  extracted  lioni  pliolgraphs  (clotted  lines).  (Reproduced 
from  |SII.VS0| 


a  method  for  computing  optical  flow  by  differentiating  the  brightness  distribution  in  the  image  with  respect 
to  time.  Optica)  flow  is  the  distribution  of  velocities  of  apparent  movement  caused  by  smoothly  changing 
brightness  patterns.  It  has  been  noted  that  optical  flows  encode  rich  information  about  a  scene  and  observer 
motion,  and  it  has  been  suggested  that  this  information  can  be  computed  from  the  flow  field.  This  position 
is  particularly  associated  with  the  followers  of  J.  J.  Gibson,  who  first  studied  flow  fields  [G1BSS0,  GIBS66, 
CLOC80.  KOEN75,  KOEN76.  KOEN77,  PRAZ80],  In  particular,  it  has  been  suggested  that  optical  flow 
facilitates  object  segmentation  [NAKA74,  CLOC80J,  computation  of  the  parameters  of  the  observer’s  own 
motion  relative  to  the  scene  [PRAZ80,  LONG80],  and  the  determination  of  visible  local  surface  normals 
[PRAZ80J. 

The  work  on  interpreting  optical  flow  has  generally  assumed  that  the  flow  is  given,  that  it  is  somehow 
computed  automatically  and  sufficiently  noise-free.  "Velocity  sensitive  neurons"  have  been  postulated  to  com¬ 
pute  the  optical  flow  in  animate  visual  systems  [NAKA74J.  Horn  and  Schunck  (HORN81cJ  have  studied  the 
generation  of  the  optical  flow  from  brightness  patterns  that  vary  smoothly  with  time.  I  'hey  restrict  attention 
to  imaging  a  flat  surface  with  uniform  incident  illumination,  and  smoothly  varying  reflectance.  The  image 
brightness  at  point  (x,  y)  at  time  t  docs  not  change,  and  so 

dI(*,V,  0  _  . 

dt 

Expanding,  by  die  chain  rule  we  find 


,  +  IyV  +  I,  =  0, 


where  (u,  t)  is  the  optical  flow  ,  ’jfi).  This  shows  that  the  component  of  the  flow  field  in  the  direction  of 
the  brightness  gradient  (/,,  l„)  is 
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It  is  not  possible  to  determine  the  component  of  the  flow  field  perpendicular  to  the  intensity  gradient, 
that  is  to  say  along  the  iso-brightness  contours.  In  practice,  quantisation  errors  and  noise  imply  that  ;J(  is  not 
exactly  zero.  To  account  for  this,  an  error  term  Et  is  introduced  and  defined  by: 

£4  =  Ixti  -}-  Iy v  —  It. 

To  compute  the  component  of  the  flow  field  along  iso-brigluncss  contours  requires  an  extra  constraint 
Horn  and  Schunck  derive  a  measure  of  the  departure  from  smoothness  of  the  flow  [HORN81c],  Smoothness 
can  be  estimated  by  the  square  of  die  magnitude  of  die  gradient  of  the  optical  flow  velocity: 

The  estimate  of  the  departure  from  smoodincss  and  the  change  in  brightness  combine  in  a  measure  of  the 
error: 


Ffcwn'  .■*!.  Optical  ftw  p;ttiem*  aimputcd  by  the  lloru-Sch  track  trfgorklmt.  (Reproduced  from 
|l M  IK NXIc.  bguie  10) 
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edge  finding.  Recall  that  edge  finding  has  three  stages.  First,  significant  intensity  changes  arc  detected  and 
localized.  The  feature  points  are  then  grouped  to  form  linear  segments.  Finally,  segments  arc  interpreted 
as  scene  events,  such  as  depth,  reflectance,  and  shadow  boundaries,  as  well  as  discontinuities  in  surface  orien¬ 
tation  (true  edges).  Analogously,  the  process  of  segmentation  begins  by  isolating  those  regions  of  an  image 
in  which  there  arc  no  significant  changes  of  intensity,  and  adjacent  regions  arc  then  grouped,  or  "merged". 
Finally,  the  regions  arc  interpreted  as  scene  events,  typically  visible  surfaces,  shadowed  areas,  or  patches  in 
which  the  reflectance  is  uniform.  As  in  the  case  of  edge  finding,  the  difficult  issue  is  to  frame  a  precise 
definition  of  "significant"  so  that  segmented  regions  correspond  to  the  perceptual  entities  that  are  their  inter¬ 
pretations. 

Some  authors  [MARR78,  page  64]  have  concluded  that  segmentation  is  an  ill-defined  operation,  since 
regions  do  not  always  correspond  to  portions  of  visible  surfaces.  Certainly,  simple  schemes  for  segmentation 
produce  many  spumis  regions,  just  as  simple  approaches  to  edge  finding  ascribe  significance  to  spurious 
intensity  changes.  Several  authors  have  pointed  out  that  region  finding  is  no  more,  and  no  less,  difficult  than 
edge  finding  (HARA79,  BINF81].  If  segmentation  and  edge  finding  differ  at  all,  it  is  with  respect  to  the 
descriptions  naturally  associated  with  two-dimensional  regions  and  one  dimensional  segments. 

I  varly  work  on  segmentation  implicitly  modelled  an  image  as  a  collage  of  regions  that  are  homogeneous 
in  intensity  and  separated  by  step  changes.  A  slight  refinement  was  to  accommodate  noise  hcuristically  by 
merging  across  weakest  contrast  boundaries  [BRIC70,  BARR71], 

One  approach  to  improving  segmentation  schemes  is  to  incorporate  better  models  of  edge  finding.  Kach 
of  the  processes  for  discovering  feature  points  outlined  in  section  3.1.1  can  be  adapted  to  segmentation. 
Haralick  (HARA80,  page  62]  observes  that  two  pixels  arc  part  of  the  same  region  if  and  only  if  there  is  no 
significant  difference  between  their  associated  sloped  facets.  If  every  intensity  change  uncovered  by  the  Marr- 
I  lildrcth  theory  of  edge  finding  is  significant  then  closed  contours  of  zero-crossing’,  cor respond  u*  regions. 

An  alternative  approach  to  improving  segmentation  is  to  invoke  domain  specific  semantic  information 
cither  to  encourage  or  inhibit  the  merging  of  regions  ]TFNK77,  Sid  1-81).  Such  schemes  for  segmentation  arc 


analogous  to  the  semantically  guided  edge  finders  advocated  by  [BAJC75,  BAJC76b,  SHIR73). 

Horn's  work  on  shape  from  shading  discussed  in  the  previous  section  implies  that  there  can  be  significant 
variations  in  intensity  within  a  perceptual  surface.  In  general,  only  a  planar  surface  produces  a  region  that  is 
uniform  in  intensity  (ignoring  noise).  Segmentation  on  the  basis  of  intensity  values  is  a  heuristic  consequence 
of  the  early  preoccupation -with  scenes  composed  of  planar  surfaces  (sec  section  2).  According  to  the  image 
irradiancc  equation,  intensity  is  uniform  within  the  image  of  a  planar  region  because  the  surface  orientation  is 
constant  Ballard  [BALI  .80]  suggests  that  the  concept  of  segmentation  is  more  naturally  associated  with  repre¬ 
sentations  based  on  surfaces:  Marr's  2  JD  sketch,  Horn's  needle  map,  and  Barrow  and  Tencnbaum’s  intrinsic 
images.  As  before,  segmentation  is  the  dual  of  discovering  significant  changes,  say  of  surface  orientation  or 
depth.  Such  processes  await  investigation.  Ballard  proposes  that  the  Hough  transform  can  be  generalized  for 
this  purpose  [BAI.L80]. 

Many  surfaces  have  constant  texture  or  color.  Color  may  be  perceptually  uniform  across  a  surface 
even  if  there  is  significant  variation  in  intensity.  Horn’s  work  [HORN74],  based  on  Land’s  retincx  theory, 
embodied  the  idea  of  segmentation  on  the  basis  of  "lightness"  for  a  two-dimensional  world  of  "Mondrians". 
Intending  I  lorn's  work  to  three  dimensions  would  not  be  trivial.  Tomita,  Yachida,  and  Tsuji  (TOM  171]  also 
experimented  with  segmentation  on  the  basis  of  color.  Ohlandcr,  Price,  and  Reddy  [OHLA78]  experimented 
with  multi-spectral  descriptions  including  hue,  saturation,  and  brightness.  Brady  and  Wielinga  [BRAD78]  note 
that  the  Ohlandcr  program  works  well  on  "patchwork  quilt"  images  that  are  composed  of  large  regions  that 
are  uniform  in  one  of  its  nine  descriptors.  Tcncnbaum  and  Barrow  [TENE77J  observe  that  because  it  is  based 
on  this  heuristic,  the  program  is  easily  fooled,  especially  by  regions  of  repeated  texture. 

3.4  Texture 

Texture  is  a  compelling  visual  cue  to  the  properties  of  a  surface.  We  can  recognize  a  region  of  an  image 
as  grass  or  the  foliage  of  a  bush  or  tree,  and  often  we  tan  do  so  in  a  black-white  image  without  the  aid 
of  color.  We  easily  distinguish  velvet,  woollen  weaves,  herring  hone,  and  raffia.  Pebbled  paths  sund  out 


75 


from  the  surrounding  soil.  It  seems  that  most  terrain  classification  from  satellite  images  is  based  on  texture 
discrimination  and  recognition. 

Haratick  [HARA79]  points  out  that  although  hundreds  of  articles  have  been  written  on  the  subject  of 
computer  recognition  and  description  of  texture  ^mostly  from  the  standpoint  of  pattern  recognition),  few 
precise  definitions  of  texture  have  been  given.  As  a  result,  texture  discrimination  techniques  arc  largely  ad 
hoc.  Most  accounts  of  texture  arc  based  on  the  idea  that  its  distinguishing  characteristic  is  regularity  of  the 
"primitive"  elements,  called  lexels,  of  which  the  texture  is  composed,  and  of  the  spatial  relationships  between 
tcxcls.  If  there  is  wide  variation  in  the  size  of  individual  blades  of  grass,  or  if  the  blades  arc  sparsely  and  non- 
uniformly  distributed  in  the  image,  the  grassy  texture  appears  "ragged".  In  general,  the  strength  of  a  texture  is 
determined  by  the  regularity  of  its  tcxcls  and  regularity  in  the  spatial  relationships  between  the  tcxcls.  Zuckcr 
proposes  that  ideal  textures  arc  completely  regular  and  can  be  modelled  by  regular  two-dimensional  graphs 
IZUCK76].  He  suggests  that  naturally  occurring  textures  are  distortions  of  ideal  textures. 

We  prefer  a  rather  different  view  of  texture,  based  on  an  idea  of  what  purpose  texture  perception 
serves.  A  grassy  lawn,  the  foliage  of  a  tree,  and  a  pebbled  path  arc  all  perceived  as  surfaces.  Microscopic 
variations  in  a  surface  determine  its  reflectance  IHORN79J,  while  large  scale  variations  in  a  surface  determine 
its  topography.  The  processes  of  determining  shape  from  stereo,  contour,  texture,  and  motion  arc  discussed 
in  section  4.  Mostly  they  operate  on  isolated  edges  and  regions  found  by  one  of  the  processes  discussed  in 
sections  3.1  and  3.3.  We  suggest  that  texture  refers  to  surface  variations  intermediate  between  microscopic 
reflectance  changes  and  topographical  changes  made  explicit  by  edge  finding  and  segmentation.  It  follows  that 
descriptions  of  texture  require  the  isolation  of  macroscopic  surface  facets  and  the  determination  of  the  spatial 
relationships  between  such  facets.  In  order  to  be  perceived  as  a  single  surface,  surface  facets  (tcxels)  that  are 
physically  close  should  have  similar  descriptions.  Regularity  is  the  physical  basis  for  grouping  facets  as  a  single 
surface.  Surface  variations  arc  labelled  reflectance,  texture,  or  topographic  depending  upon  the  resolution  at 
which  they  arc  viewed.  (See  [MAI.F.77]  for  similar  remarks). 

The  twin  themes  of  statistics  and  structure  run  through  most  of  the  literature  on  texture.  We  commented 
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above  that  regularity  is  central  to  texture.  Inevitably,  regularity  has  been  modelled  statistically;  for  example, 
the  distribution  of  slopes  of  individual  blades  of  grass  has  a  strong  peak  and  small  variance.  Statistics  has 
been  applied  more  or  less  uncritically  to  texture.  Maleson,  Brown  and  Feldman  [MALE77]  quip  that  "the 
problem  with  statistical  analysis  is  that  if  an  inappropriate  set  of  statistical  measures  is  used,  the  final  results 
arc  meaningless.  For  this  reason,  it  is  important  to  base  statistics  on  a  reasonable  model  of  the  phenomena  to 
be  measured.”  One  approach  to  a  ’reasonable  model’  is  to  apply  statistical  analysis  only  to  texels  that  carry 
significant  information  about  surface  structure,  in  particular,  those  isolated  by  edge  finding  and  segmentation. 

Haralick  [HARA79]  has  presented  a  good  survey  of  purely  statistical  approaches  to  texture.  Simple  ideas 
such  as  computing  autocorrelation  functions  perform  relatively  poorly  [WESK76],  Bajcsy  [BAJC73,  BAJC76] 
model  regularity  by  periodicity  as  determined  from  features  of  the  polar  form  P(r,  4>)  of  the  Fourier  transform 
of  subimages.  Combining  all  r  to  show  the  dependence  on  peaks  in  Pr(4)  give  evidence  of  directional 
textures  such  as  grass.  If  there  arc  no  peaks  in  Pt{4>),  PJLr)  is  investigated  -for  peaks  that  give  evidence  of 
blob-like  textures.  Textures  need  to  be  strongly  periodic  to  be  found  by  the  method.  A  better  model  was 
introduced  by  Julcsz  [JULH62J  and  refined  by  several  authors,  including  Roscnfcld  and  Troy  (ROSE70)  and 
I  laralick  [HARA7 1}.  Ihc  co-occurrence  P(i,  j,  d)  specifics  the  relative  frequencies  with  which  two  grey  levels 
t  and  j  occur  separated  by  a  distance  d.  Haralick  and  Bosky  JHAR  A73J  computed  a  number  of  features  from 
co-occurrence  matrices  and  used  them  to  classify  terrain  from  satellite  mages,  achieving  success  rates  of  over 
80%.  Jules/.  (JUI.K71)  conjectured  that  textures  can  be  discriminated  by  non-attentivc  vision  if  and  only  if 
they  differ  in  their  second  order  statistics  (essentially  their  co-occurrcncc  matrices).  As  originally  formulated, 
co-occurrence  matrices  specify  the  relative 'frequencies  of  individual  grey  levels.  Horn’s  work  on  shape  from 
shading  shows  how  much  information  is  confounded  in  a  single  grey  level.  Only  when  surfaces  arc  essentially 
planar,  for  example  satellite  imagery,  is  grey  level  a  reliable  basis  for  aggregation  into  regions  corresponding 
to  surfaces.  I  laralick  [I  IARA79,  page  787]  notes  that  while  coMXXurrcncc  based  on  grey  levels  captures  spatial 
relationships  it  does  not  capture  shape  aspects  and  hence  docs  not  work  well  for  textures  composed  of  largc- 
urca  texels.  In  '.hint,  individual  pixels  arc  poor  descriptors  of  surface  facets. 
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Co-occurrence  is  not  restricted  to  grey  levels,  however.  Malcson,  Brown,  and  Feldman  [MAI.K77] 
propose  segmented  regions  as  texcls.  They  suggest  region  descriptors  that  arc  insensitive  to  scale,  such  as 
the  orientation  of  the  major  axis  and  eccentricity  of  the  best  fitting  ellipse  to  a  region.  Details  of  the  perfor¬ 
mance  of  a  system  based  on  this  technique  on  a  range  of  textures  has  yet  to  be  published.  Marr  [MARR76] 
suggests  that  texture  discrimination  based  on  co-occurrence  matrices  could  be  accounted  for  by  discrimination 
on  ordinary  statistics  applied  to  the  primal  sketch.  The  scheme  was  not  implemented,  nor  were  descriptions 
proposed  for  texture.  To  this  end,  the  main  advance  has  been  due  to  Vilnrotter,  Nevada,  and  Price  [V11.N81). 
Their  work  is  based  on  the  Nevada  and  Babu  edge  finder  (sec  section  3.1).  Textures  are  detected  from  edge 
repetition  arrays  that  specify  the  co-occurrence  of  edges  in  a  particular  direction  at  a  particular  spacing.  Once 
detected,  texels  are  described  in  terms  of  their  average  size  and  intensity.  Spatial  organization  is  found  by 
relating  texels  in  different  directions.  Figures  40  and  41  show  the  results  computed  by  the  system  for  raffia  and 
brick  textures. 
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Figare  40.  a.  image  of  raffia,  b.  Sample  of  output  from  analysii  of  edge  repetition  arrays,  c. 
abstract  representation  of  the  texds  found  in  the  raffia  image,  d.  Reconstruction  of  the  raffia 
image  using  the  abstract  texds  (Reproduced  from  (V1LN81,  figures  1-4] 


b  l,l,,S,?lion  of  abs«™«  Primitives  found  in  the  images 
(ViinSl  Jri  Sl  SPat,U'  of8aniZ8llon  i"  **  textures  in  a.  (Reproduced 
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4.  Determining  shape  from  the  primal  sketch 

4.1.  Shape  from  stereo 

The  slight  disparities  in  the  images  received  by  the  left  and  right  eyes  enable  humans  to  determine  the 
shape  and  relative  depth  of  visible  surfaces.  The  importance  of  automating  stereo,  and  the  difficulty  of  the 
problem,  is  well  stated  in  a  recent  overview  of  Defense  Mapping  Agency  applications  [MAH081]. 

There  have  been  several  attempts  to  develop  a  computational  theory  of  binocular  stereopsis  since 
Julesz's  demonstrations  in  the  early  1960’s  that  it  is  possible  to  fbse  images  stcrcoscopically  without  extensive 
monocular  processing.  Jules/.  [JU1.H71]  presented  substantial  experimental  evidence  regarding  binocular  fu¬ 
sion  of  random  dot  stereograms,  a  perceptual  device  that  he  originatcdfscc  figure  42).  The  essence  of  stereo 
vision  is  the  matching  of  descriptions  computed  from  the  images  presented  to  the  left  and  right  eyes.  The 
Jules/,  demonstrations  argue  that  the  descriptions  to  be  matched  are  available  at  an  early  stage  of  visual 
processing.  Two  candidate  descriptions  considered  for  matching  to  date  arc  the  image  ( area  correlation),  and  a 
representation  of  intensity  changes  { edge  based  stereo). 

Jules/  conjectured  that  stereo  is  a  local  parallel  process,  and  a  number  of  algorithms  have  been  designed 
with  this  conjecture  in  mind.  The  first  of  these  is  due  to  Dev  [DKV75J,  closely  followed  by  Marr  and  Poggio 
(M  ARR76b,  MARR76c).  Marr  and  Poggio  call  their  algorithm  "cooperative"  by  analogy  with  boundary  value 
computations  in  physics.  The  algorithm  could  equally  well  be  called  a  relaxation  process  [DAVI81].  Marr 
|MARR78)  notes  a  number  of  difficulties  with  such  algorithms  as  a  theory  of  human  stereo  vision,  namely 
human  tolerance  for  (he  dcfocussing  of  one  image,  and  the  apparent  ubiquity  of  vcrgcncc  movements  of  the 
eyes  as  two  images  are  fused.  Perhaps  more  important  arc  the  so-called  hysteresis  effects  in  which  images 
arc  matched  only  after  a  delay,  or  remain  fused  when  they  arc  pulled  apart  by  an  amount  greater  than  is 
apparently  possible  for  matching.  Marr  and  Poggio  |MAKK79b|  argue  that  while  hysteresis  effects  suggest 
ctxrperalivity.  the  effect  can  also  be  achieved  by  postulating  a  dynamic  memory  in  which  intermediate  results 
of  stereo  processing  can  be  stored. 
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Figure  42.  A  random  dot  stereogram  devized  by  [JULE71J.  First,  an  image  is  produced  for  the 
left  eye,  composed  of  random  dots.  The  view  from  the  right  image  is  determined  by  translating 
each  dot  in  the  random  dot  image  leftwards  by  an  amount  that  depends  on  the  relative  distance 
of  the  corresponding  point  in  a  conceptual  scene.  Some  dots  are  occluded  as  a  result.  Other  image 
points  that  could  not  be  seen  by  the  left  eye  are  now  visible  in  the  right  eye.  Such  points  arc 
randomly  filled  by  new  dots. 


Most  work  on  area  correlation  stereo  {HANN74,  QUAM71,  HF.ND78J  operates  on  a  succession  of  small 
windows  (typically  10  by  10)  from  one  image.  For  each  window  in  the  left  image,  a  search  is  conducted 
for  that  window  in  the  right  image  that  optimizes  a  suitable  correlation  relation  between  the  grey  levels  in 
the  two  windows.  Area  correlation  has  proven  to  be  particularly  effective, in  textured  or  smoothly  shaded 
areas.  It  has  supported  terrain  following  automatic  guidance  systems,  and  some  automatic  mapping  systems 
where  the  goal  is  to  generate  a  digital  terrain  model  associating  a  height  with  each  map  point  imaged. 
Area  correlation  implicitly  assumes  that  the  left  and  right  images  dillcr  only  in  viewpoint,  tli.it  is  they  only 
differ  photometrically.  As  a  result,  area  correlation  pci  forms  poorly  near  surface  discontinuities  where  tins 
plioioiuetric  assumption  is  false.  Conversely,  edge  based  stereo  assumes  that  the  invariance  between  (he  left 
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Figure  43.  The  zero  crossings  located  in  (he  (but  channels  of  the  Man- Hildreth  theory  for  the 

random  dot  image  shown  in  a.  (Reproduced  from  Grim  son's  (onhooming  book  (GRJM81D  ^ 

and  right  images  is  geometric,  (taker  and  Hi n ford  [DAKB81]  observe  that  in  general  the  geometric  assumption 
implicit  in  edge  based  stereo  is  more  realistic  than  the  photometric  assumption  implicit  in  area  correlation.  A 
further  shortcoming  of  current  area  correlation  techniques  is  that  their  accuracy  is  limited  to  a  fraction  of  the 
window  si/c  (typically  5  picture  elements).  Friges  can  normally  be  localized  with  subpixcl  accuracy  [MACV81, 

MARR79a]. 

Implicit  in  the  above  remarks  about  the  suitability  of  area  correlation  for  stereo  matching  of  textured 
areas  is  a  model  of  texture  based  on  grey  levels.  We  found  earlier  (Section  3.4)  that  texture  describes  surface 
•ii.ktosI riu  lure  with  tcxels  corresponding  to  surface  facets.  Hie  extension  of  the  approaches  to  edge  based 
'id  co  to  densely  textured  areas  awaits  further  work  on  edge  and  region  based  accounts  of  texture. 

I  dge  based  stereo  is  strong  where  area  correlation  is  weak,  and  conversely.  An  additional  advantage  of 

l"-..  d  step -II  is  its  potentially  greater  efficiency,  as  theic  are  considerably  fewer  edges  ihan  gicy  levels.  v 
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Stereo  rests  upon,  and  provides  a  stiff  test  for,  any  account  of  edge  finding.  In  section  3.1.1  we  discussed  a 
number  of  approaches  to  edge  finding.  Marr  and  Hildreth's  approach  to  detecting  feature  points  has  been  ap¬ 
plied  to  stereo  by  Marr  and  Foggio  (MARR79bj.  The  left  and  right  images  are  convolved  with  AG  operators 
as  described  in  3.1.1.  Matching  takes  place  between  the  paired  sets  of  zero  crossings.  Figure  21  showed  the 
image  of  a  coffee  jar  sprayed  with  spots  of  paint  to  yield  a  Julcsz-like  random  dot  stereogram  from  a  real  scene, 
and  figure  24  showed  the  zero  crossings  produced  by  each  of  the  four  channels  proposed  by  the  Marr-Hildreth 
theory.  Figure  43  shows  the  zero  crossings  produced  in  each  of  the  four  channels  for  the  random  dot  image 
shown  in  figure  43a.  In  both  figures  24  and  43,  it  is  evident  that  it  is  considerably  more  difficult  to  establish 
an  optimal  match  between  the  output  of  the  fine  channel  from  the  left  and  right  images  than  between  the  out¬ 
puts  of  the  coarse  channel.  Fxploiting  this  observation,  matching  proceeds  from  the  coarsest  channel,  which 
makes  explicit  gross  detail  and  establishes  a  rough  correspondence,  down  to  the  finest  resolution  channel. 
This  coarse-to-finc  strategy,  in  which  a  rough  plan  is  used  to  narrow  the  search  space  prior  to  more  detailed 
processing,  is  a  basic  idea  in  artificial  intelligence.  The  application  of  a  coarse-to-finc  strategy  like  that  in  the 
Marr-Poggio  theory  of  stereo  seems  to  have  been  used  by  Moravee  [MORA80]  in  a  system  constructed  at 
Stanford.  Note  that  the  coarse-to-finc  strategy  may  have  to  be  modified  for  closely  spaced  edges  that  occur 
with  textured  surfaces. 

Once  the  match  between  the  zero  crossings  in  the  two  images  has  been  established  for  the  four  channels, 
one  can  compute  the  angular  disparities  (or  even  distances)  to  matched  zero  crossings,  (GRIM81]  gives  details. 
Figures  44  and  45  show  the  disparity  values  computed  for  the  coffee  jar  and  die  random  dot  stereogram  shown 
in  figure  42.  A  disparity  value  is  recorded  only  where  zero  crossings  from  the  two  eyes  arc  matched,  and 
so  die  disparity  map  is  often  discrete.  Since  we  mostly  perceive  the  world  as  composed  of  smooth  surfaces, 
it  is  necessary  to  consider  possible  interpolation  processes  for  smoothly  completing  the  surface  orientation 
map  from  the  discrete  set  of  disparity  values.  'Ihis  is  a  general  problem  and  is  discussed  in  the  next  section. 
Ciiimson  s  reconstruction  process  computes  the  shape  shown  in  figure  46.  Giimson's  implementation  of  dtc 
Marr  Foggio  stereo  theory  demonstrates  all  of ’  Jnles/.'s  experimental  findings.  It  lias  also  been  applied  to  a 


Figure  44.  The  disparity  map  computed  from  (he  output  of  the  stereo  matcher  for  the  coffee  jar. 

(Reproduced  from  Gnmson's  forthcoming  book  (GRIM81( 

small  number  of  stereo  pairs  of  natural  images. 

In  section  3.1  we  characterized  edge  finding  as  having  three  successive  stages:  determining  feature  points, 
grouping  them  on  the  basis  of  their  attributes,  and  interpreting  them  as  scene  events.  The  Marr-Poggio  theory 
matches  feature  point  descriptions  on  the  basis  of  the  position  and  sign  of  the  zero  crossing,  before  the  feature 
points  arc  grouped  into  linear  segments.  Recent  psychophysical  findings  of  Mayhcw  and  l-'risby  [MAYII81J 
scan  to  indicate  that  it  is  necessary  to  match  richer  descriptions  than  icro  crossings.  Baker  and  Binford 
(IIAKI'181)  and  Arnold  |ARN078|  propose  that  ambiguities  can  be  resolved  more  efficiently  and  successfully 
on  the  basis  of  the  richer  descriptions  associated  with  points  on  linear  segments.  Baker  and  Binford  [BAKH81] 
match  points  at  various  scales  using  the  position,  contrast,  and  sk*pc  of  the  segment  in  the  image,  and  the 
intensities  on  both  sides  of  the  intensity  change.  These  separate  piacs  of  evidence  are  combined  by  a  linear 
weighting  ftiiMiou.  T  he  optimal  match  is  found  along  horizontal  scan  lines  using  a  last  linear  programming 
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Figure  45.  The  disparity  map  computed  from  die  output  of  the  stereo  matcher  for  the  random 
dot  stereogram  shown  in  figure  42.  (Reproduced  from  Crimson’s  forthcoming  boolt[GRIM81]) 

technique.  Once  edges  arc  matched,  grey  levels  arc  correlated  by  a  similar  process,  f  igure  47  shows  the  results 
computed  by  Baker  and  Binford's  program  on  an  image  with  both  texture  and  edges.  Arnold  [ARN078|  also 
filters  putative  matches  according  to  the  position,  slope,  and  contrast  of  edge  segments.  ITie  edge  segments 
arc  found  using  Hucckcl  s  surface  fitting  technique.  Arnold  claims  that  this  is  the  program’s  main  deficiency. 
It  is  interesting  to  speculate  how  the  Baker  and  Binford  or  Arnold  algorithm  might  perform  if  they  had  die 
Murr-Mildrcth  zero  crossing  data  to  work  on.  Alternatively,  it  is  interesting  to  ask  how  die  richer  descriptions 
proposed  by  Baker  and  Binfiird,  Arnold,  and  Mayhcw  and  I  ’risby  could  be  incorporated  into  die  MarrPoggio 
theory. 

All  of  the  programs  discussed  in  this  section,  except  Arnold’s,  assume  that  the  left  and  right  images  have 
been  icctified  prior  to  stereo  matching.  Iliat  is,  they  assume  that  the  images  have  been  rotated,  translated, 
and  scaled  so  that  corresponding  feature  points  t  ail  be  found  on  the  same  horizontal  scan  line.  Arnold’s 


Hwrc  46.  IV  rcnwflnmod  coffee  jor  InicrpotafeU  hy  GrhitnTs  ptm;r»ni  frnm  (he  (HsmHty 
,M;1’  nP",e  +*•  (Kt|M<nl»R«l  fiiHti  (iriiiiMM's  CxilHomiNti  K.4.  (« ;RIMM1)> 
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Figure  47.  Example  results  of  Baker  and  Binford’s  stereo  program,  a.  Stereo  pair  of  images  of 
natural  terrain,  b.  The  edges  found  in  the  images  by  a  simple  differencing  operation,  c.  Illustration 
of  disparities  computed  for  the  images.  (Reproduced  from  [BAKE81,  figures  10,11,  and  17.]) 

program  relies  upon  a  rectification  procedure  developed  by  Moravcc  and  Gcnnery  [MORA79,  GKNN79J.  In 
this  procedure,  "interesting"  points  such  as  corners  arc  found  in  both  images,  and  an  optimal  match  is  found, 
file  tentative  match  is  refined  using  a  high  resolution  area  correlator.  A  camera  model  solver  computes  the 
direction  of  the  stereo  axis,  the  relative  rotation,  scale  change,  and  lateral  translation  between  the  left  and  right 
views.  The  ground  plane  is  also  determined.  Lucas  and  Kanade  have  recently  explored  the  application  of  a 
Ncwton-Raphson  like  technique  to  solve  for  the  camera  paramctcrs[LUCA81).  Rectification  remains  a  difficult 
open  problem. 

4.2  Shape  front  contour 

Wilkin  (WITK811  has  make  a  start  on  what  scorns  to  be  a  promising  approach  to  computing  shape  from 
a  primal  sketch.  Ilis  work  concerns  the  pctccivcd  slant  and  tilt  of  a  line  drawing  lying  in  a  plane,  such  as  the 
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map  outline  shown  in  figure  48.  Witkin’s  approach  relies  on  making  the  image  forming  process  explicit,  and 
using  it  to  derive  a  probability  density  function.  Assume  that  the  axes  in  the  image  and  in  the  planar  scene  are 
aligned,  and  denote  the  tangent  direction  measured  in  the  image  by  o*  and  the  tangent  at  the  corresponding 
point  in  the  scene  by  0.  Image  foreshortening  gives  the  relation 


tan(a*  —  r )  = 


tan/? 

coaer’ 


where  r  is  the  tilt  and  a  is  the  slant  of  the  planar  scene.  A  collection  of  measurements  of  a  taken  throughout 
the  image  define  a  distribution  of  tangent  directions.  If  we  hypothesize  particular  values  for  a  and  r ,  the  above 
relation  establishes  a  distribution  for  0.  Given  an  expected  distribution  for  (0,a,  r),  the  likelihood  of  any 
observed  distribution  of  a  can  be  evaluated.  Witkin  shows  that  the  probability  density  function  of  (0,  a,  r)  is 
.  It  turns  out  that  the  relative  likelihood  of  (a,  r)  given  a  set  A*  of  measurements  of  a*  is 


tt  _ sr~~awngcoag _ 

i^n  coa2(a*  —  r)  ainJ{a<  —  r)co #o 

Ihc  value  of  (a,  r)  for  which  this  estimator  assumes  a  maximum  is  the  maximum  likelihood  estimate  for 
surface  orientation.  Figure  49  shows  the  results  of  this  procedure  applied  to  a  variety  of  shapes,  and  compares 
it  to  the  tilt  as  estimated  by  humans.  Witkin  found  that  tilt  could  be  estimated  considerably  more  accurately 
than  slant,  a  result  he  and  Stevens  [STEV80]  established  independently.  In  further  work,  Witkin  assumes  that 
surfaces  arc  locally  planar  and  applies  a  similar  analysis  to  compute  local  surface  orientation  [WITK81}. 


4.3  Shape  from  texture 

(.’f  the  modules  which  seem  to  bridge  the  gap  between  the  primal  sketch  and  die  surface  orientation  map, 
none  has  received  quite  as  much  attention  from  psychologists  as  the  computation  of  surface  orientation  and 
depth  from  texture  gradients.  Ever  since  Gibson  (GlliSSO]  drew  attention  to  their  importance  for  computing 
depth  (tigure  SO),  tin  >  have  been  a  major  concern  of  his  followed.  Stevens  |S'I  TV80|  notes  the  simplifications 
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Figure  48.  A  gcogoiphic  contour  shown  ;il  various  orientations,  with  the  density  function  obtained 
at  that  orientation,  flic  density  function  is  plotted  by  iso  density  cootoui,.  win,  ivpirented 
in  im4.ii  lonn:  o  is  riven  by  distance  to  the  origin,  r  by  the  angle  I  lie  >liaip  symmetric  peaks 
eleailv  visible  at  higher  slants  are  the  niaciiiniiii  likelihood  estimates  loi  {o,  i)  Repioduced  Irom 
|Wlli;xi.  figure  4) 
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HjiUff  49.  Results  of  iimnim:  Wilkin's  cslim.it  inn  strategy.  A  number  of  shapes  are  shown  at 
lell.  I  he  tenter  column  pints  huntiin  estimation  nf  Ibc  lilt  of  the  shiipcv  and  Htc  riflM  «<umn 
sliows  die  till  valor*  pralivlcd  hy  the  cslhninion  strategy.  (RipuskKstl  horn  |*|  I KKI. figure  5| 


Figure  SO.  A  texture  gradient  in  a  natural  scene.  (Reproduced  from  [GJBS50] 


assumed  by  most  published  analyses  of  texture  gradients  in  the  psychological  literature.  Typically,  a  horizontal 
ground  plane  is  assumed  that  stretches  into  the  far  distance.  Stevens  proposes  a  two  step  computation:  (1) 
isolate  "characteristic  directions"  in  which  there  is  no  depth  change,  and  (2)  compute  depth  from  the  slant  and 
tilt  representation  of  surface  orientation.  The  idea  has  not  been  implemented.  It  assumes  that  primitive  tcxcls 
can  be  computed  for  natural  images  with  sufficiently  precise  descriptions  that  the  characteristic  directions 
can  be  computed  accurately.  Bajcsy  and  Licbcrman  [!!AJC76a]  base  the  computation  of  texture  gradients  on 
Bajcsy 's  applicaton  of  the  Fourier  power  spectrum  to  describing  texture  (sec  section  3.4)  [BAJC73|.  All  of  the 
other  methods  for  computing  texture  discussed  in  section  3.4  could  be  adapted  to  the  determination  of  texture 
gradients. 

Render  |KFNI)80|  has  considered  the  computation  of  shape  from  texture  as  an  instance  of  a  general 
methodology  dial  yields  "shape  from"  algorithms  from  a  variety  of  image  observables.  T  he  geneial  plan  of 
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Render's  approach  has  three  parts: 

•  Primitive  tcxcls  arc  extracted  from  the  image.  Render  assumes  that  texels  are  the  image  of  planar 
surface  facets,  but  he  offers  no  guidance  for  computing  them. 

•  Kach  tcxcl  is  assigned  a  set  of  possible  scene  parameters.  This  is  the  core  of  the  approach.  He  introduces 
a  set  of  normalized  texture  property  maps  (NTPM)  that  generalize,  for  example,  Horn’s  reflectance  map 
(section  3.2). 

•  tcxcls  that  arc  assumed  to  arise  from  neighboring  surface  facets  in  three  space  compare  the  constraints 
on  their  sets  of  possible  parameters,  casting  out  those  that  are  inconsistent  on  some  appropriate  grounds  of 
smoothness.  As  Render  points  out,  this  step  is  similar  to  relaxation  processing  as  advocated  by  Davis  and 
Roscnfcld  (DAV181J. 

Ballard’s  parameter  networks  bear  many  similarities  to  Render’s  scheme  [BALI.81].  Where  Render 
prefers  intersecting  constraints.  Bullard  prefers  adding  them  in  accumulator  arrays  as  part  of  his  advocacy  of 
the  generalized  Hough  transform. 

Render’s  NTPMs  have  four  associated  chokes. 

•  Since  the  goal  of  a  "shape  from"  algorithm  is  a  precise  description  of  surface  shape,  an  appropriate 
parameterization  of  surface  orientation  needs  to  be  chosen.  Popular  choices  arc  gradient  space  (section  2, 
section  3.2),  the  Gaussian  sphere  [HORN82],  and  stcreographk  space  [IRITU81]  (see  section  3.2).  In  the 
example  presented  below,  we  choose  gradient  space. 

•  I  he  imaging  geometry  is  a  key  component  of  texture,  gradients.  The  essential  choice  is  between 
perspective  and  parallel  (orthographic)  projection.  Render  shows  that  while  the  mathematics  of  perspective 
projection  is  more  complex,  the  constraint  it  offers  is  considerably  tighter.  For  mathematical  simplicity,  we 
choose  p.u.illel  projection. 

•  Assuming  (hat  texels  have  somehow  been  made  available,  several  texture  measures  can  Ire  computed 
and  i  elated  to  possible  scene  li.ii'.inents.  Popular  choices  arc  texcl  length  (for  example  the  length  of  the  major 
avis  (.1  ..lie  1.1  the  bairels  shewn  in  ligure  5(1),  the  slope  in  the  image  of  some  direction  associate  .1  will,  li.c 


Figure  51.  A  texture  with  an  unusual  relationship  between  facets  and  the  underlying  planar 
surface.  (Reproduced  from  [KEND80,  figure  3.4] 

tcxcl  (compare  [MAI.E77],  the  angle  in  the  image  between  two  directions  associated  with  the  tcxcl  (compare 
Kanadc’s  work  on  skew  symmetry  discussed  in  section  2  [KKND80]),  or  dot  or  edge  density  (compare 
[KOSE70,  ROSK71J.  We  consider  length  and  slope  in  the  example  below. 

•  Finally,  the  way  in  which  the  facet  that  projects  to  the  texcl  is  connected  to  the  underlying  surface  has 
to  be  assumed.  In  figure  51  the  facets  can  be  interpreted  as  lying  in  the  plane  or  protruding  from  it. 

As  an  example  of  Kcndcr’s  approach,  consider  the  abstract  texture  shown  in  figure  52.  We  shall  make 
the  following  choices:  gradient  space  representation  of  surface  orientation,  parallel  projection,  and  length 
and  image  slope  of  tcxels.  We  shall  assume  that  the  texcls  all  lie  in  a  planar  surface  and  form  two  mutually 
orthogonal  sets.  We  shall  show  that  the  orientation  of  the  surface  is  completely  determined. 

We  first  consider  the  NTI’M  coi  responding  to  the  length  of  a  texel.  Figure  53  shows  a  texcl  of  length  L 
and  slope  a  in  the  image.  Suppose  that  one  end  of  the  texcl  is  at  the  image  origin  and  that  the  corresponding 
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figure  51  An  abstract  texture.  The  horizontal  and  lexcls  slanted  at  45°  are  assumed  to  have  the 
same  length  in  the  image  and  in  the  scene,  h  is  further  assumed  that  the  horizontal  texds  are 
orthogonal  to  the  slanted  texels  in  the  scene.  (Reproduced  from  [KEND80.  figure  3.9| 

scene  point  is  (0,0,  d).  Suppose  that  the  deprojection  of  the  other  end  of  the  tcxcl  is  (L  cos  a,  L  sin  a, 
Since  the  deprojection  of  the  tcxcl  lies  in  the  plane  whose  normal  is  (p,  q,  —  1),  it  follows  that  e  —  d 
pL  cos  a  -f  (jLsina.  The  length  of  the  dcprojcctcd  tcxcl  is  therefore 

L„  —  L[ i  -f  (pcoso  -f-  g  sin  a)2)  i. 

Applying  this  to  (lie  texture  shown  in  figure  52  we  have  =  Ly  that  is 

(1  +  pl)  =  (1  + 

or. 


Figure  53.  Length  and  slope  of  a  texel  in  the  image. 

p2  —  q2  -  2pq  =  0. 

Wc  now  consider  the  N I  PM  corresponding  to  image  slope  a  of  the  texel  shown  in  figure  53.  Consider 
a  scene-based  coordinate  system  defined  by  the  normal  to  the  planar  facet,  the  line  of  steepest  descent  of 
the  facet,  and  a  direction  chosen  to  make  a  right  handed  system.  The  gradient  line  has  direction  ratios 
l  ---  ( p,q,p 2  q1).  The  normal  to  the  plane  is  n  —  (p,  q,  —1),  and  so  the  third  direction  of  the  scene- 
based  coordinate  system  is  the  cross  product  of  these  two,  namely  m  -  ( q ,  — p,  0).  Consider  the  deprojection 
y  =-  (cos  a,  sin  «,  d)  of  the  texel  shown  in  figure  53.  Kender  [KTNI  )K0,  page  IM|  defines  the  slope  of  y  tube 
p.  where 
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If  wc  assume  that  it  lies  in  the  plane,  so  that  u  •  a  *=  0,  we  find 

t.- gcoaa-paino 

(pcoao -f-9*ina)(l  +Pa  +  »3)' 

Applying  this  to  the  texture  shown  in  figure  52,  the  slope  of  the  horizontal  tcxels  /3b  is  given  by 

Similarly,  the  slope  0^  of  the  slanted  tcxels  is  given  by 

U"^  “  (?  +  p)(l  +  P2  +  92) 

If  wc  assume  that  the  tcxels  all  lie  in  the  plane  and  that  they  form  two  orthogonal  sets,  we  have 

tan/lb  •  tan  0^  =  —I. 

Solving,  we  get  another  quadratic  in  p  and  q.  When  combined  with  the  length  constraint  we  can  solve  up 
to  Ncckor  reversal.  Render  points  out  that  if  perspective  projection  is  assumed  the  sense  of  the  Neckcr  reversal 
is  often  resolved. 

4.4  Shape  from  motion 

Just  as  the  ideas  about  shape  from  shading  and  edge  detection  described  in  Sections  3.1  and  3.2  lead 
naturally  to  progress  on  motion  perception,  so  do  the  developments  surrounding  the  primal  sketch.  The  first 
treatment  of  this  issue  is  due  to  Ulhnan  111  1 .1  .M78J,  who  considered  the  problem  of  establishing  a  correspon¬ 
dence  between  the  primal  sketches  in  two  successive  image  frames.  UUman  also  studied  the  problem  of 
computing  the  structure  of  a  rigid  body  from  the  correspondences  of  a  small  number  of  points  in  a  number  of 
views.  It  turns  out  that  remarkably  few  of  each  ate  required  to  compute  rigid  three-dimensional  structure.  In 
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modelling  normal  vision  of  course,  sparsity  of  information  is  manifestly  not  the  problem!  A  different  way  to 
view  such  results  is  that  they  give  information  about  how  local  an  algorithm  to  dctcrcminc  three-dimensional 
structure  can  be.  More  recently,  Webb  [WF.BB80,  WUBB81],  Holfman  and  Flinchbaugh  [HOFF80],  and 
Rashid  [RASH80]  have  considered  the  problem  of  reconstructing  motion  in  depth  from  the  output  of  the 
correspondence  computation.  Flinchbaugh  and  Chandrasekharan  [FI.IN81J  coin  the  term  "dynamic  primal 
sketch"  to  describe  the  representation  they  compute,  since  it  associates  an  image  velocity  measure  with  every 
primal  sketch  element.  Flinchbaugh  and  Chandrasekaran  [FL1N81J  have  proposed  a  number  of  grouping 
primitives  to  apply  to  the  dynamic  primal  sketch,  analogous  to  those  discussed  above  for  the  (static)  primal 
sketch. 

S.  Modules  that  operate  on  representations  of  surface  shape 

Many  of  the  visual  processes  discussed  in  uie  previous  sections  compute  the  shape  of  a  visible  surface  by 
finding  the  local  surface  orientation  everywhere  within  its  boundaries.  Ihis  includes  the  work  of  Horn  and 
his  colleagues  on  shape  from  shading  (Section  3.2),  the  computation  of  shape  from  contour  investigated  by 
Witkin  (section  4.2),  and  the  interpretation  of  optical  flow  [PRAZ80,  CI.OC80).  On  the  other  hand,  shape 
from  stereo  yields  disparity  only  at  the  discrete  set  of  zero  crossings.  A  change  of  coordinates  can  convert 
the  angular  disparities  to  depths,  but  to  compute  the  local  surface  normal  everywhere  on  the  visible  surface  it 
is  necessary  to  interpolate  a  smooth  surface  from  the  discrete  set  of  given  points.  We  shall  discuss  this  issue 
below.  Binocular  stereo  is  not  the  only  module  that  generates  an  incomplete  surface  orientation  map.  Shape 
from  texture  (section  4.3)  computations  yield  (constrained)  surface  orientations  only  at  texture  points,  which 
may  be  more  or  less  densely  distributed.  Stevens  (S I  HV81J  considers  the  interpretation  of  surface  contours, 
and  finds  that  they  strongly  constrain  the  perception  of  the  underlying  surface.  Horn  (I  IORN82|  and  Marr 
|M  Alt R 78aJ suggest  that  in  addition  to  local  surface  orientation,  it  is  advantageous  to  make  explicit  the  discon- 
tinuiles  in  surface  orientation  and  depth.  It  is  not  yet  clear  how  surface  not  mats  should  be  parameter i/ed,  nor 
how  an  nratcly  their  values  should  be  represented.  Moreover,  substantial  advantages  are  likely  to  accrue  from 
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attaching  texture  and  color  descriptors  to  visible  surfaces,  but  the  details  arc  as  yet  unclear. 

One  might  also  consider  maintaining  separate  representations  corresponding  to  the  four  (or  more)  chan¬ 
nels  defined  in  the  Marr-Hildreth  theory  of  edge  detection  (described  in  Section  3.1.1  and  used  in  the  Marr- 
Poggio  theory  of  stereo).  This  would  enable  the  visible  surfaces  in  a  scene  to  be  represented  at  different  scales. 
It  is  dear  that  surface  information  needs  to  be  made  explicit  at  different  levels  of  resolution:  a  pebbled  path 
may  be  considered  approximately  planar  by  a  human  who  is  walking  along  it  On  the  other  hand,  an  ant 
or  person  on  roller  skates  may  find  the  same  path  extremely  difficult  to  navigate;  in  such  cases  the  path  is 
unlikely  to  be  perceived  as  planar.  As  this  example  indicates,  the  level  of  resolution  of  a  representation  is 
determined  largely  by  the  process  operating  upon  the  representation,  and  there  has  been  little  investigation  of 
such  processes  to  date.  Hinton  shows  that  different  representations  of  the  same  volume  and  set  of  surfaces 
can  have  a  significant  influence  on  the  difficulty  of  perceptual  tasks  (H1NT79].  Similarly,  we  have  seen  that 
grouping  processes  play  an  important  role  at  several  stages  of  visual  processing,  from  edge  finding  to  the  inter¬ 
pretation  of  texture.  Such  processes  have  not  yet  been  extensively  investigated  at  the  level  of  representations  of 
surface  orientations. 

Perhaps  the  most  important  operation  performed  by  any  vision  system  is  recognition.  Representations 
below  the  level  of  surfaces  arc  generally  too  unstructured  to  support  recognition.  One  notable  exception  to  this 
is  recognition  of  surface  type  from  texture  information.  Interestingly,  we  suggested  in  section  3.4  that  texture 
is  a  form  of  surface  representation.  It  has  been  argued  that  the  surface  orientation  map  is  also  inappropriate, 
in  essence  because  it  is  viewer  centered.  Marr  (MARR78aj  notes  that  we  arc  capable  of  rccogni/jng  objects 
from  a  wide  variety  of  views,  against  a  wide  variety  of  backgrounds.  To  achieve  this,  he  suggests  a  repre- 
scnt.iiiou  which  makes  explicit  the  three  dimensional  ("volumetric”)  nature  of  objects.  We  shall  consider  such 
representations  in  the  next  Section.  I'or  the  moment  we  need  only  note  that  it  is  highly  non-(ri\ial  to  extract 
'  olutnctric  representations  from  a  surface  based  representation,  and  so  practical  advantages  might  accrue  from 
recognition  based  on  the  surface  oi  ientation  map. 


I  lie  case  against  surf;KC  based  models  of  objects  for  recognition  is  essentially  an  argument  against  mill- 
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tiple  views.  Horn  [HORN82]  notes  that  irrespective  of  the  force  of  the  argument  as  regards  general  human 
vision,  surface  based  models  may  still  support  important  practical  applications.  For  example,  because  of  Ihe 
limitations  imposed  by  methods  of  manufacture,  many  industrial  parts  only  assume  a  small  number  of  stable 
configurations.  Symmetry  further  reduces  the  number  of  substantially  different  views  of  a  part.  Since  there  are 
typically  only  a  small  number  of  parts  in  a  parts  mix,  one  can  store  a  representation  computed  from  the  surface 
orientation  map  corresponding  to  each  different  view  of  a  part  in  each  configuration.  Horn  further  suggests 
that  it  may  be  sufficient  to  throw  away  positional  information  and  model  an  object  by  the  distribution  of  its 
surface  normals  on  the  Gaussian  sphere  [HORN 82].  Figure  54  illustrates  the  idea. 

Perhaps  the  most  difficult  problem  which  sighted  people  constantly  rely  on  their  vision  systems  to  help 
them  to  solve  is  the  perception  or  planning  of  movements  through  cluttered  space.  The  experience  of 
programming  robots  to  avoid  obstacles  and  discover  a  satisfactory  trajectory  between  two  positions  reveals 
the  staggering  difficulty  of  the  geometric  problems  involved,  problems  which  the  human  visual  system  solves 
effortlessly.  Space,  considered  as  an  object,  typically  occupies  a  volume  and  consists  of  a  surface  whose 
descriptions  push  current  representational  frameworks  to  their  limits,  if  not  far  beyond  them.  A  solid  start  has 
been  made  on  the  problems  of  spatial  planning  by  l-ozano-Pcrcz  [1 .0/.A8 1  ].  who  represents  the  set  of  possible 
configurations  which  an  object  can  assume  in  the  presence  of  obstacles  and  presents  efficient  algorithms  for 
computing  near  optimal  trajectories.  A  further  important  application  lies  in  making  precise  the  rather  vague 
notion  of  cognitive  map.  It  is  usually  supposed  [LYNC60]  that  this  only  refers  to  object  representations. 
Actually  it  seems  that  we  have  quite  considerable  navigational  processes  which  operate  on  the  surface  orienta¬ 
tion  map. 

We  conclude  this  section  with  a  discussion  of  the  problem  of  interpolating  a  smooth  surface  from  a 
discrete  set  of  points,  such  as  the  disparity  map  computed  by  Grimson’s  implementation  of  die  Marr-l’oggio 
theory  of  stereo  (section  4.1).  One  approach  might  be  to  apply  the  work  on  Coons  patches,  ltd/icr  surfaces, 
and  Ferguson  surfaces  developed  for  work  in  computer  aided  design  (CAD)  and  computer  aided  manufacture 
(CAM )  |F  AUX79|.  it  is  however  worth  asking  whether  the  interpolated  surface  can  be  constrained  by  what  wc 
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know  about  human  vision,  by  isolating  constraints  that  have  perhaps  not  figured  largely  in  the  development  of 
CAD/CAM.  Kssentially,  two  such  constraints  have  been  uncovered,  and  arc  currently  receiving  attention. 

The  first  was  introduced  by  Grimson  [GRIM81].  Suppose  that  Dact,uii  is  the  disparity  map  from  which 
we  are  to  interpolate  a  smooth  surface  S.  Horn’s  work  on  image  formation  tells  us  how  to  construct  the  image 
7m(S),  and  this  enables  us  to  compute  the  set  of  zero  crossings,  and  hence  predict  a  disparity  map  D,,rfdiet. 
'Hie  actual  and  predicted  disparity  maps  should  agree  everywhere,  Actually,  one  docs  not  explicitly  construct 
the  image  of  the  interpolated  surface  and  the  predicted  disparity  map.  Rather,  it  is  used  implicitly  in  deriving 
a  number  of  theorems  which  constrain  the  surface  S.  Grimson  has  coined  a  suggestive  slogan  for  this  analysis: 
no  information  is  information,  since  the  absence  of  an  initial  value  at  the  point  (x,  y)  in  the  actual  disparity  map 
means  that  the  gradient  of  the  interpolated  surface  S  cannot  change  too  rapidly  there. 

The  second  constraint  is  based  on  the  idea  that  the  human  visual  system  constructs  the  most  conservative 
solution  consistent  with  the  data.  Figure  55  is  reproduced  from  (BARR8Ib).  and  shows  a  set  of  possible  space 
curves,  all  of  which  produce  an  elliptical  image.  Significantly,  we  arc  unaware  of  most  such  possibilities,  espe¬ 
cially  those  that  arc  discontinuous.  We  arc  able  to  interpolate  smooth  curves  and  surfaces  without  involving 
rich  semantics.  It  also  seems  that  the  shape  of  the  boundary  plays  the  most  significant  role  in  determining 
the  interpolated  surface  (see  for  example  figure  56,  which  is  reproduced  from  [BARR81bl.  Taken  together, 
these  ideas  suggest  that  the  interpolation  process  can  be  modelled  in  terms  of  the  calculus  of  variations  (see  for 
example  [COUR37,  volume  I]). 

The  idea  is  to  choose  an  appropriate  "performance  index"  P  and  define  the  interpolated  surface  to  be 
that  which  minimizes  the  integral  of  P  subject  to  the  boundary  constraints.  This  idea  has  been  explored  by 
a  number  of  authors.  Unlike  the  ordinary  differential  calculus,  it  is  not  generally  the  ease  that  a  minimal 
surface  exists,  even  for  "plausible"  performance  indices.  For  example,  it  is  not  clear  that  there  is  a  unique 
surface  that  minimizes  the  ia^p#fftk^0ftasian  curvature.  Grimson  (GRIMX1 1  notes  that  the  existence  of 
a  minimizing  surface  can  be  formally  guaranteed  if  the  performance  index  satisfies  the  technical  condition  of 
being  a  seminorm.  He  suggests  the  quadratic  variation,  which  is  defined  to  he  2 f*y  }-  and  shows 
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how  to  construct  the  iteration  operator  shown  in  figure  57.  The  square  Laplacian  f\z  4-  }yy  also  satisfies  the 
seminorm  condition.  Brady  and  Horn  (BRAD81b]show  that  any  quadratic  form  in  the  second  derivatives  /„, 
/xy.  and  fyy  is  a  seminorm  and  leads  to  a  unique  minimal  surface.  They  further  show  that  the  rotationally  sym¬ 
metric  performance  indices  form  a  vector  space  spanned  by  the  quadratic  variation  and  the  square  i-aplacian. 
Since  both  operators  satisfy,  the  same  Ruler  equation  A2/  =  0,  they  cannot  be  distinguished  away  from  given 
boundary  points.  Brady  and  Horn  apply  the  statics  of  a  thin  plate  to  show  that  the  quadratic  variation  provides 
the  tighter  constraint.  Grimson  notes  that  the  null  space  of  the  quadratic  variation  is  larger  than  that  of  the 
square  l.aplacian,  containing  for  example  the  function  f(x,  y)  =  xy  [GR1M81J.  He  has  worked  out  several 
examples  showing  that  the  quadratic  variation  leads  to  surfaces  that  accord  better  with  human  intuition.  Brady 
and  Grimson  (forthcoming)  use  these  ideas  about  surface  interpolation  to  propose  that  subjective  contours 
arise  from  surface  perception. 

Barrow  and  Tenenbaum  (BARR81bj  observe  that  in  order  to  interpolate  the  circular  cross  section  of  a 
cylinder  and  sphere  it  is  sufficient  to  assume  that  the  curvature  varies  linearly  in  the  image.  They  suggest  that 
in  general  one  should  choose  a  linear  expression  for  the  curvature  to  minimize  the  least  squares  error.  Brady, 
Grimson,  and  Langridge  [BRAI)80b]  use  an  approximation  to  the  one  dimensional  quadratic  variation  f*ZI  to 
argue  dial  subjective  contours  arc  cuhics.  'I he  exact  minimal  integral  curvature  curve  has  recently  been  found 
by  Horn  (IIORN81b|. 


6.  Viewpoint  independent  representations  of  objects 

The  surface  based  representations  discussed  in  the  previous  section  arc  different  for  each  particular  view¬ 
point.  I'ach  viewpoint  of  c.ich  viewer  in  a  scene  defines  a  coordinate  frame  in  terms  of  which  the  points  that 
arc  visible  from  that  viewpoint  can  be  described.  Other  coordinate  frames  arc  naturally  associated  with  the 
objects  and  surfaces  in  a  scene,  ami  it  is  often  more  convenient  to  describe  relative  positions  and  movements 
hi  those  frames  lather  than  in  the  ones  lined  up  with  a  particular  viewpoint.  In  many  scenes  there  is  a  naimal 
"t'loh.il"  io.iidm.iie  frame  that  is  independent  of  any  viewpoint,  lor  example,  an  ail  plane  or  ship  lias  an 


Fijjwc  55.  An  clli|Hknl  inutgc,  and  some  of  (he  space  turves  thnl  inighl  have  generat'd  H. 
(Reproduced  fi(Hi)  IHAK.RUTb,  figure  3*2] 
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Figure  57.  The  surface  interpolation  operator  derived  by  Crimson  from  minimizing  quadratic 
variation. 

associated  frame  defined  by  its  bow,  stern,  starboard,  port,  up,  and  down;  rotations  about  those  axes  specify 
the  yaw,  roll,  and  pitch.  A  football  field  or  a  room  has  a  natural  frame  defined  by  the  sidelines  or  walls  and  by 
the  gravitational  vertical. 

Points  car  reorcse'1  <n  homogeneous  coordinates,  for  example,  and  frame  transformations  by  4x4 
matrices  that  consist  of  a  translation,  a  rotation,  and  a  scale  factor.  This  apprauh  has  proved  valuable  in 
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computer  graphics  [CARI.78]  and  robotics  [PAUL79].  Rotations  can  also  be  described  as  quaternions  with  a 
saving  of  storage  (TAYL79,  BROO80].  Frames  can  specify  the  transformation  to  scene  coordinates,  and  hence 
by  composition  relate  different  viewpoints.  Brooks  and  Binford  [BROO80]  note  that  one  important  use  of 
inter-relating  frames  by  composition  is  to  make  affixment  relations  explicit  The  coordinate  frame  local  to  an 
airplane  needs  to  be  related  to  that  defined  by  the  runway  on  which  it  stands.  The  programming  language  AL 
[FINK74]  was  the  first  to  provide  a  mechanism  for  the  automatic  maintenance  of  affixment  relations. 

Most  objects  arc  composed  of  connected  parts,  each  of  which  can  be  described  in  its  own  local  frame.  A 
person  has  two  arms,  each  of  which  is  further  subdivided  into  an  upper  arm,  a  forearm,  and  a  hand.  I  ike  any 
structured  representation,  the  important  issues  concern  the  choice  of  "primitives"  and  the  means  by  which  one 
part  of  a  representation  is  related  to  another.  Consider  the  latter  issue  first  Work  in  Robotics  has  adopted 
the  Hartenberg-Denavit  notation  for  kinematic  chains  to  describe  the  geometric  inter-relationships  between 
successive  links  of  an  arm,  a  leg,  or  the  several  legs  of  a  mobile  robot  [PAUL79J.  Marr  and  Nishihara’s 
suggestion  [MARR78b]  is  a  special  ease  of  this  notation. 

One  approach  to  primitives  is  to  consider  objects  to  be  composed  of  instances  of  a  small  set  of  prototype 
volumes,  such  as  spheres,  blocks,  and  triangular  prisms  (BRA  173].  This  approach  has  been  much  used  in 
CAD/CAM.  Ihc  problem  is  that  even  simple  objects  have  a  complex  description.  One  might  add  more 
and  more  primitives,  such  as  truncated  cones  and  pyramids,  to  reduce  this  complexity.  Binford  [BINF71] 
suggested  another  approach  that  has  proved  very  fruitful.  He  introduced  a  more  general  class  of  volumes 
called  generalized  cones  which  includes  as  subclasses  the  primitive  volumes  mentioned  previously.  A  gcncral- 
i/cd  cone  describes  a  volume  by  sweeping  a  cross  section  area  along  a  space  curve,  cjllcd  the  "spine",  while 
dclhi ming  it  according  to  some  sweeping  rule.  Figure  58  is  reproduced  from  (BR0O8I]  and  shows  a  number 
"I  generalized  t  ones.  Notice  that  although  elongation  is  the  characteristic  property  of  generalized  cones,  they 
are  not  necessarily  elongated.  Nor  do  they  require  a  circular  cross  section.  Nevertheless,  generalized  cones 
ue  p.ir lie nl.ii ly  well  suited  to  describing  objects  which  have  a  natural  axis.  This  certainly  includes  growth 
tiucluus.  I  iollerlMch  |IIOI  I  75J  noted  that  ( Week  amphora  ate  also  well  dcu  t  illed  b>  generalized  cones,  the 
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spine  being  a  result  of  the  process  of  manufacture  on  the  potters  wheel.  Similar  considerations  apply  to  objects 
turned  on  a  lathe  or  produced  by  extrusion.  Conversely,  objects  produced  by  moulding,  beating,  welding,  or 
sculpture  tend  to  be  awkwardly  described  in  terms  of  generalized  cones. 

A  major  issue  in  description  and  recognition  arises  from  the  vast  number  of  objects  that  we  can  distin¬ 
guish.  This  leads  to  an  enormous  data  base  of  models  and  makes  the  indexing  process  of  crucial  importance. 
The  problem  is  ubiquitous  in  artificial  intelligence  and  has  produced  a  number  of  schemes  for  matching  on 
die  basis  of  partial  descriptions.  One  recurrent  theme  is  the  use  of  abstraction  to  produce  a  smaller  search 
space,  the  solution  being  used  to  guide  further  search  in  a  less  abstracted  version.  At  a  suitably  high  level  of 
abstraction  this  can  be  recognized  as  the  process  which  underlies  die  matcher  in  the  Marr-Poggio  theory  of 
stereo  described  in  Section  4.1.  In  the  specific  ease  of  vision,  Nevatia  and  Binford  [NF.VA77]  and  Marr  and 
Nishihara[MARR78b]  discuss  various  schemes  for  indexing.  Agin  [AGIN72],  Nevatia  and  Binford  [NKVA77], 
and  Marr  and  Nishihara  [M  ARR78b]  note  that  a  kinematic  linkage  can  generally  be  approximated  by  a  single 
cone.  Such  approximate  descriptions  provide  for  hierarchical  descriptions  at  a  useful  variety  of  scales.  Often, 
the  most  useful  approximation  is  based  on  die  most  proximal  link,  more  detailed  descriptions  deriving  from 
applying  the  same  process  to  the  distal  links  of  the  chain.  Brooks  and  Binford  |BROO80]  use  subcategorics  of 
objects  to  achieve  property  inheritance  and  facilitate  indexing.  For  example,  they  exploit  the  fact  that  a  Boeing 
747-SP  is  a  special  kind  of  Boeing  747  (with  slight  variations  pertinent  to  recognizing  one),  and  a  Boeing  747  is 
a  special  kind  of  wide  bodied  jet  (distinguished  from  other  aircraft  such  as  Boeing  727's  on  die  basis  of  overall 
length  and  width  to  length  ratio.) 

Brooks  and  Binford  (BROO80,  BR0081)  draw  attention  to  the  need  to  incorporate  constraints  into  ob¬ 
ject  descriptions.  For  example,  a  person  has  two  legs  w  hich  arc  of  (roughly)  the  same  length,  and  are  roughly 
as  long  .is  the  person's  body.  The  actual  sizes  scale  with  (a  priori  unknown)  camera  position.  As  usual, 
constraints  propagate.  For  example,  the  engine  pods  of  a  jet  arc  deployed  symmetrically  on  die  front  wings  on 
either  side  of  die  fuselage.  Finding  an  aircraft  wing  constrains  the  overall  scale  of  the  aircraft,  ami  hence  die 
length  ol  the  fuselage.  Such  constraints  are  represented  naturally  by  numeiica!  inequalities.  Brooks  |BR(X)8I| 
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describes  a  program  lhal  determines  the  solutions  of  a  set  of  such  inequalities.  If  an  object  rccogni/.cd  as  a 
person’s  body  is  much  larger  than  one  thought  to  be  a  tree,  then  the  person  is  probably  much  nearer  than  the 
tree.  Mechanisms  for  taking  into  account  relatively  remote  possibilities  such  as  giants  and  toy  trees  have  been 
proposed  (for  example,  [ANDF81]. 

finally,  we  consider  the  process  of  extracting  from  an  image  tire  spine,  cross  section  function,  and  sweep¬ 
ing  rule  which  define  a  generalized  cone.  The  work  on  this  problem  to  date  requires  a  number  of  simplifying 
assumptions.  For  example,  Nevada  and  Rinford  implicitly  assume  that  the  cross  section  function  is  circular 
[NFVA77J.  Marr  [MARR77J  considered  the  problem  in  considerable  detail  and  showed  how,  in  a  restricted 
ease,  a  straight  spine  can  be  extracted  from  the  inflection  points  on  the  bounding  contour  of  an  object.  llrady 
showed  that  the  spine  can  be  extracted  more  reliably  by  using  stationary  points  of  curvature  [HRAD79b], 
Marr's  work  assumes  that  the  bounding  contour  is  planar,  which  is  overly  restrictive  [RRUS81],  Me  also 
proposed  a  classification  of  the  images  of  the  joins  between  two  straight  spine  cones. 
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